Analysis of Texas Public School Spending and STAAR Performance

Project Overview

A1. Research Question

The research questions this project sought to answer is whether we can predict if a Texas public school district would perform well or do poorly on the STAAR test based on how they allocate or spend their budget with consideration to how racially diverse a district is. This project also sought to analyze which financial expenditure items are important predictors of whether a district will perform well or do poorly on the STAAR test. The purpose of this analysis is to provide counsel to school administrators and policymakers so that they can make data-driven decisions instead of relying on anecdotal evidence alone.

A2. Project Scope

This project’s scope was to use a Jupyter notebook to build a model that predicts whether a Texas public school district would perform well or do poorly on the STAAR test. This model took several inputs of financial expenditure data and then outputted a prediction related to that district’s performance on the STAAR test. This project examined academic year 2018-2019.

A3. Solution Overview – Tools

The Jupyter notebook that contains the model was coded in the Python programming language and relied on several key inputs from the end-user. The first input is a csv file called Enrollment Report_Statewide_Districts_Grade_Ethnicity_2018-2019 (Texas Education Agency, 2019), which contains demographic information on white and non-white population of the district. The second key input is tidy_campstaar1_2012to2019 and tidy_campstaar2_2012to2019 (Texas Education Agency, 2022), which contains information on how many students in the district performed well (meets or masters a subject) or did poorly (approaches comprehension of a subject). The third key input is the 2007-2021-summaried-peims-financial-data (Texas Education Agency, 2022), which contains information on how a district spent their budget. The notebook then automatically performed cleansing and transformation of the data to make it suitable for analysis and modeling. The notebook then constructed a model that predicted how well a district would perform or do poorly on the STAAR test.

A4. Solution Overview – Methodologies

There were four different types of methodologies used in this project: Project, Data Collection, Analytical, and Statistical. Each methodology played an important role in the planning, execution, and verification of the project.

Project Plan

            In this project, the author executed the plans without change. All goals, objectives, and deliverables listed below were completed as describe in Task 2 apart from the dates or time frame for each task. The timeline projected in Task 2 were very conservative and the actual project completed earlier than planned.

B1. Project Execution

The goal for this project was to create a model that would predict how a district would perform on a STAAR test. The objective of this goals are:

  • Determine which districts are performing or underperforming in relation to how much funding they receive, how racially diverse a district is, and how well the students in the district perform on the STAAR test overall.
  • Determine which type of expenditures contribute to the weighted performance of the district.
  • Predict how a district will perform based on how they allocate their budget.

B2. Project Planning Methodology

The project planning methodology used was PAPEM-DM (De Dios, 2020). The seven steps involved in this method include: Planning, acquisition, preparation, exploration, modeling, delivery, and maintenance.

  1. Project Planning – answered the why of the project, addressed the how, delivered a list of research questions to answer and criteria of success.
  2. Acquisition of Data – acquired the data from the TEA and Kaggle websites.
  3. Preparation of Data – prepared the data which includes cleaning the data and transforming it into a format suitable for analysis and modeling.
  4. Exploratory Data Analysis – explored the data using descriptive statistics, visualize the relationship between variables using matplotlib and seaborn. The deliverables of this phase would be a few graphs ready for presentation to stakeholders.
  5. Modeling – created, fitted, and used the model for training and inference. The deliverables were a pickle file for the trained model.
  6. Delivery of Results – delivered the results by completing Task 3 of this capstone project.
  7. Maintenance – while important, this project did not include this phase. However, the Jupyter notebook can be used as a basis for deployment into production (for inferencing) using any of the cloud technology platforms like AWS, Azure, or GCP.

B3. Project Timeline and Milestones

PhaseDescriptionDeliverableStartEndDuration
PlanningProject planningProject plan8/18/37 days
AcquisitionGetting the dataRaw datasets8/48/41 day
PreparationCleansing and transformationClean datasets8/58/2113 days
ExplorationEDAVisualizations8/58/2113 days
ModelingTrain and inferPickle file8/58/2113 days
DeliveryTask 3 ReportPDF Document8/228/287 days
MaintenanceDeployment to productionNot ApplicableN/AN/AN/A

Methodology

C. Data Selection and Collection Process

The author downloaded two of the datasets on the TEA website and while the third data could have also downloaded from the same place, the author chose to use the Kaggle version of the data because it has already been collected in a format that would not necessitate me using web scraping to mine the information from the website.

Data selection and collection didn’t vary much from the proposed plan. The only one addition the author made was the inclusion of ethnicity data that the author downloaded from the same TEA website. This data was necessary to examine and analyze the effect of racial diversity within the districts.

One obstacle the author encountered was in the same dataset that the author added during the actual execution of the project. The author encountered several errors when reading the csv file and realized that formatting is off on the actual csv itself. The author adjusted the read_csv parameter “skiprows” and was able to bypass the header information that was needed in the analysis.

There really were no data governance issues because the dataset is already in the public domain. The one concern would have been anonymity of the students in schools or school districts that are too small that the anonymity of the students would have been impossible to mask. This issue was already taken care by the Texas Education Agency before they even published the dataset.

For example, for those districts that have small subpopulations, the values for the numerator values have been coded as -999. The author simply had to replace those values with a 0 for the dataframe to be processed cleanly. Another example of the datasets’ cleanliness is the fact that there were no missing values in any of the columns.

Therefore, cleanliness and sanitation were the obvious advantages of the datasets used. However, their limitation was the fact that when merged together, the dataset only encompasses the 2018-2019 school year. This is by design, Hence, not really that big of a limitation to warrant much concern.

D. Data Extraction and Preparation Process

The author utilized the power of pandas to read the csv files into a dataframe. The method read_csv() was appropriate because the data is in a tabular format delimited by a comma.

For data preparation or data wrangling, the author used many techniques to handle missing and duplicate data, as well as remove extraneous columns and rows. Some of these techniques include replacing null placeholder values with true numeric NaN values, drop_duplicates() and drop() columns, as well as, string methods to clean up the or standardized the columns names. The author also used locs, ilocs, and np.where to filter columns and rows. Finally, for merging the datasets, the author a combination of concat() and merge() methods to execute inner and outer joins of the data. Utilizing pandas methods made sense because of their efficiency, versatility, and simplicity.

E. Data Analysis Process

For exploratory data analysis, the author used several visualizations to examine the relationship between variables. For example, the author used a heatmap to visualize the correlation coefficients via a correlation matrix using the Pearson method. Also, a pairplot was used to examine whether any of the relationships between variables had a positive or a negative trend. Both were essential in determining and validating whether any of the variables suffered from multicollinearity. Knowing this is important because it would determine what type of machine learning algorithm can or cannot be used if the author choose not to handle the related variables individually. Finally, several violin plots were used to examine the distribution and density of each variable belonging to both passing and failing classes.

To complement the visualizations, the author implemented a few statistical tests to analyze the dataset. These tests were the Levene’s test, the Shapiro test, and the Mann-Whitney U test. The author used Levene’s and Shapiro to test for variance and normality so that the author can validate whether to use parametric or non-parametric methods of statistical testing. As a result, the author used Mann-Whitney U to test for significance.

However, statistical testing does have its own limitations. For one, qualitative factors of the study are not considered because “statistical method cannot study phenomena that cannot be expressed, captured, or measured in quantitative terms.“ For example, we considered the efficiency of districts while regarding for racial diversity, but on must be careful in assigning weights to the imbalance because it is simply not a simple black and white matter. One cannot simply assign a numerical value on racial disparity and there is no clear-cut formula that defines racial inequity neither.

In the author’s opinion, both visualizations and statistical testing techniques were both needed at the same time. Visualization, although limited to eye-balling techniques can reveal hidden patterns in the data while statistical testing can validate assumptions made by eyeballing said data.

Results

F. Project Success

F1. Statistical Significance

Calculating the statistical significance of this project was a reasonably straight-forward task. First, I had defined a formula to define the measure of efficiency with respect to racial diversity. I called the measure E for efficiency, and E = (B/(P/S))/X where B is the total operating program budget of the district, P is the total number of passing (meets or masters a subject) students, S is the total number of students taking the test, and X is for the ratio of non-white student to white student enrollment. Based on this formula, I then calculated the first quartile of created two subsets of data based on that quartile. Those equal or below the quartile contained district data that had performed well on the STAAR test and those above the quartile contained districts that did poorly on the STAAR test. I then tested for the statistical difference between the two group samples. Since the p-value is below 0.05, the difference between the two groups were found to be statistically significant. With this, the author can reject the null hypothesis and confirm the original hypothesis that the “Payroll” expenditure (and other variables for that matter) of districts that performed well on the STAAR test is statistically different than those districts that performed poorly.

F2. Practical Significance

The practical significance of the difference in certain expenditures will result in a significant amount of money saved by each district. For example, if a school district decides to spend more on operations versus paying more for capital investments, the district would have realized more of a return on this investment in terms of higher student performance scores on the STAAR test.

F3. Overall Success

The author believes this project a success. All three criteria laid out in task two were met. First, the formula for calculating the efficiency of a district is appropriate and applicable. Second, the model has an accuracy and AUC scores of more than 70%. And third, the project lists the most important features for high-performing districts.

G. Key Takeaways

This project set out to create Jupyter notebook that contains a model that would predict whether a Texas public school district would perform well or do poorly on the STAAR test. This model needed to be based on the weighted definition of efficiency and also need to have an accuracy or AUC score of more than 70% to be considered a success. The following chart summarizes whether the Jupyter notebook accomplished its first two metric.

Criterion/MetricRequired DataSuccess
Is the formula for calculating the efficiency appropriate and applicable?E = (B/(P/S))/XYES
Does the model have an accuracy and AUC scores of more than 70%?Accuracy 90.93% / AUC 96%YES
Does the project list the most important features for high-performing districts?Bar graphYES

The table above is a straightforward way to visualize and summarize the accomplishments of each objective.

The graph above shows the Receiver-Operator curve (ROC) which visualizes just how skilled the model is (blue line) compared to an unskilled model (by chance, red line).

The graph below shows the most important features of the model and is a graphical way to convey how each different variables affects the model based on the F score.

            Based on the findings above, the author suggests that a study be conducted on how particularly racial diversity affects the efficiency of Texas public school districts as it has been shown that it does influence the model. In addition, further study needs to be conducted to determine whether having a bilingual program is an indication of positive or negative correlation on performance.

Sources

Baron, E. J. (2021). School Spending and Student Outcomes: Evidence from Revenue Limit Elections in Wisconsin. American Economic Journal: Economic Policy, 14(1), 1-39. Retrieved from https://www.aeaweb.org/articles?id=10.1257/pol.20200226

Carhart, A. E. (2016). SCHOOL FINANCE DECISIONS AND ACADEMIC PERFORMANCE: AN ANALYSIS OF THE IMPACTS OF SCHOOL EXPENDITURES ON STUDENT PERFORMANCE. Master Thesis. Retrieved from chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://www.csus.edu/college/social-sciences-interdisciplinary-studies/public-policy-administration/_internal/_documents/thesis-bank/thesis-bank-2016-carhart.pdf

Texas Education Agency. (2022, July 31). PEIMS Financial Data Downloads. Retrieved from Texas Education Agency: https://tea.texas.gov/finance-and-grants/state-funding/state-funding-reports-and-data/peims-financial-data-downloads

Texas Education Agency. (2022, July 31). Texas Education Agency Data 2012-2019. Retrieved from Kaggle: https://www.kaggle.com/datasets/9e3ce42f60ded3ba2a6dd890993493f2c4b284c5cfa035d711bd98fa3359924c?resource=download

Thompson, J., Young, J. K., & and Shelton, K. (2021). “Evaluating Educational Efficiency in Texas Public Schools Utilizing Data Envelopment Analysis. School Leadership Review, 16(1), Article 9. Retrieved from https://scholarworks.sfasu.edu/slr/vol16/iss1/9

Using MapQuest API to Get Geo Data

A friendly tutorial on getting zip codes and other geographic data from street addresses.


Knowing how to deal with geographic data is a must-have for a data scientist. In this post, we will play around with the MapQuest Search API to get zip codes from street addresses along with their corresponding latitude and longitude to boot!

The Scenario

In 2019, my friends and I participated in CivTechSA Datathon. At one point in the competition, we wanted to visualize the data points and overlay them on San Antonio’s map. The problem is, we had incomplete data. Surprise! All we had were a street number and a street name — no zip code, no latitude, nor longitude. We then turned to the great internet for some help.

We found a great API by MapQuest that will give us exactly what we needed. With just a sprinkle of Python code, we were able to accomplish our goal.

Today, we’re going to walk through this process.

The Data

To follow along, you can download the data from here. Just scroll down to the bottom tab on over to the Data Catalog 2019. Look for SAWS (San Antonio Water System) as shown below.

Image for post
Screenshot by Ednalyn C. De Dios

Download the file by clicking on the link to the Excel file.

Image for post
Screenshot by Ednalyn C. De Dios

OR, you can click on this.

MapQuest API Key

Head on over to https://developer.mapquest.com/ and create an account to get a free API key.

Image for post
Screenshot by Ednalyn C. De Dios
Image for post
Screenshot by Ednalyn C. De Dios
Image for post
Screenshot by Ednalyn C. De Dios
Image for post
Screenshot by Ednalyn C. De Dios
Image for post
Screenshot by Ednalyn C. De Dios

Copy the ‘Consumer Key’ and keep it in a safe place. We’ll need it later.

Jupyter Notebook

Now, let’s fire up a Jupyter notebook and get coding!

For starters, let’s set up the environment by doing a couple of imports.https://towardsdatascience.com/media/7d0f7ced4082761e995ecf8ce0213c3f

Don’t forget to replace the API_KEY (line#12) with your own key above.

Now. let’s read the Excel file with a simple df = pd.read_excel().

Image for post
Screenshot by Ednalyn C. De Dios

Next, we’ll combine the street number and street name columns.https://towardsdatascience.com/media/1696465db27770b7f2942ab707d2efa5

Image for post
Screenshot by Ednalyn C. De Dios

The ALL CAPS hurts my eyes. Let’s do something about it:

df['street_address'] = df.street_address.str.title() .
Image for post
Screenshot by Ednalyn C. De Dios

Below are two functions that call the API and returns geo data.https://towardsdatascience.com/media/3ec6009e8b6069387a9edde18bdad0d3

We can manually call it with the line below. Don’t forget to replace the ‘#####’ with your own API key. You can use any address you want (replace spaces with a + character).

get_zip('https://www.mapquestapi.com/geocoding/v1/address?key=####################&inFormat=kvp&outFormat=json&location=100+ Military+Plaza&thumbMaps=false&delimiter=%2C')

But we’ve got many addresses, so we’ll use a loop to call the API repeatedly.https://towardsdatascience.com/media/9a970862f0352997417c5211df359a9b

Let’s see what the result looks like:

Image for post
Screenshot by Ednalyn C. De Dios

Finally, let’s create a dataframe that will house the street addresses — complete with zip code, latitude, and longitude.https://towardsdatascience.com/media/adfcc23ff94f54877bc80b72e2537ed9

Voila! We’ve got ourselves geo data.

Image for post
Screenshot by Ednalyn C. De Dios

For extra credit, let’s import the data in Tableau and get a pretty spiffy visual:

Image for post
Screenshot by Ednalyn C. De Dios

And that’s it, folks!

You can find the jupyter notebook here.

Thanks for stopping by and reading my post. Hope it was useful 🙂

If you want to learn more about my journey from slacker to data scientist, check out the article below:From Slacker to Data ScientistMy journey into data science without a degree.towardsdatascience.com

And if you’re thinking about switching gears and venture into data science, start thinking about rebranding now:The Slacker’s Guide to Rebranding Yourself as a Data ScientistOpinionated advice for the rest of us. Love of math, optional.towardsdatascience.com

Stay tuned!

You can reach me on Twitter or LinkedIn.

This article was first published in Towards Data Science’ Medium publication.

Get Your Feet Wet in Power B I

A hands-on introduction to Microsoft Analytics Tool


As a data scientist, you’ll need to learn to be comfortable with analytics tools sooner or later. In today’s post, we will dive headfirst and learn the very basics of Power BI.

Be sure to click on the images to better see some details.

The Data

The dataset that we will be using for today’s hands-on tutorial can be found at https://www.kaggle.com/c/instacart-market-basket-analysis/data. This dataset is “a relational set of files describing customers’ orders over time.” Download the zip files and extract them to a folder on your local hard drive.

Download Power BI Desktop

If you haven’t already, go to https://powerbi.microsoft.com/desktop and click on the “Download free” button.

Image for post
Screenshot by Ednalyn C. De Dios

If you’re using Windows 10, it will ask you to open Microsoft Store.

Image for post
Screenshot by Ednalyn C. De Dios

Go ahead and click on the “Install” button.

Image for post
Screenshot by Ednalyn C. De Dios

And let’s get started by clicking on the “Launch” button.

A Thousand Clicks

Image for post
Screenshot by Ednalyn C. De Dios

Click on “Get data” when the splash screen appears.

Image for post
Screenshot by Ednalyn C. De Dios

You will be presented with a lot of file format and sources; let’s choose “Text/CSV” and click on the “Connect” button.

Image for post
Screenshot by Ednalyn C. De Dios

Select “order_products_prior.csv” and click on the “Open” button.

Image for post
Screenshot by Ednalyn C. De Dios

The image below shows what the data looks like. Click on the “Load” button to load the dataset into Power BI Desktop.

Image for post
Screenshot by Ednalyn C. De Dios

Load the rest of the dataset by selecting “Get Data” and choosing the “Text/CSV” option on the dropdown.

Image for post
Screenshot by Ednalyn C. De Dios

You should have these three files loaded into Power BI Desktop:

  • order_products_prior.csv
  • orders.csv
  • products.csv

You should see the following tables appear on the “Fields” panel of Power BI Desktop, as shown below. (Note: the image shows Power BI in Report View.)

Image for post
Screenshot by Ednalyn C. De Dios

Let’s see what the Data View looks like by clicking on the second icon on the left side of Power BI Desktop.

Image for post
Screenshot by Ednalyn C. De Dios

And now, let’s check out the Model View where we will see how the different tables are related to each other.

Image for post
Screenshot by Ednalyn C. De Dios

If we hover a line, it will turn yellow and the corresponding related fields are both highlighted as well.

Image for post
Screenshot by Ednalyn C. De Dios

In this case, Power BI Desktop is smart to infer the two relationships. However, most of the time, we will have to create the relationships ourselves. We will cover this topic in the future.

Image for post
Screenshot by Ednalyn C. De Dios

Let’s go back to the Report View and examine the “Visualizations” panel closely. Look for the “slicer” icon which looks like a square with a funnel at the bottom right corner. Click on it to add a visual to the report.

Image for post
Screenshot by Ednalyn C. De Dios

In the “Fields” panel, find the “department_id” and click the checkbox on its left.

Image for post
Screenshot by Ednalyn C. De Dios

This will cause the “department_id” field to appear under the “Visualizations” panel in the “Field” box.

Next, take your mouse cursor and hover over the top right corner of the visual in the Report View. Click on the three dots that appeared in the corner as shown below.

Image for post
Screenshot by Ednalyn C. De Dios

Click on “List” in the dropdown that appeared.

Image for post
Screenshot by Ednalyn C. De Dios

While the “department_id” visual is selected, you should see corner marks indicating the visual as the active visual. While the “department_id” is active, press CTRL+C to copy it and then CTRL+V to paste it. Move the new visual to the right of the original visual.

Make the second visual active by clicking somewhere inside it. Then look for the “aisle_id” field in the “Fields” panel on the right of Power BI Desktop as shown below.

Image for post
Screenshot by Ednalyn C. De Dios

Try selecting a value on the “department_id” visual and observe how the selection on “aisle_id” changes accordingly.

Image for post
Screenshot by Ednalyn C. De Dios
Image for post
Screenshot by Ednalyn C. De Dios

Now, examine the “Visualizations” panel again and click on the table visual as shown below.

Image for post
Screenshot by Ednalyn C. De Dios

In the “Fields” panel, select “product_id” and “product_name” or drag them in the “Values” box.

Image for post
Screenshot by Ednalyn C. De Dios

Power BI Desktop should look similar to the image below.

Image for post
Screenshot by Ednalyn C. De Dios

This time, try selecting a value from both “department_id” and “aisle_id” — observe what happens to the table visual on the right.

Image for post
Screenshot by Ednalyn C. De Dios

Let’s create another visual by copying and pasting the table visual. This time, select (or drag) the following fields to the “Values” box of the visual.

  • order_id
  • user_id
  • order_number
  • order_hour_of_day
  • order_dow
  • days_since_prior_order

Power BI Desktop should now look similar to the image below.

Image for post
Screenshot by Ednalyn C. De Dios

Try clicking one of the selections in the table visual (where it’s showing “product_id” and “product_name”) and observe how the table on the right changes accordingly.

Image for post
Screenshot by Ednalyn C. De Dios

For a closer look, activate Focus Mode by clicking on the icon as shown below.

Image for post
Screenshot by Ednalyn C. De Dios

The table displays the details of orders that have the product that you selected in the table with “product_id” and “product_name.”

Get out of Focus Mode by clicking on “Back to report” as shown below.

Image for post
Screenshot by Ednalyn C. De Dios

Let’s rename this page or tab by right-clicking on the page name (“Page 1”) and selecting “Rename Page.”

Image for post
Screenshot by Ednalyn C. De Dios

Type in “PRODUCTS” and press ENTER.

Image for post
Screenshot by Ednalyn C. De Dios

let’s add another page or tab to the report by right-clicking on the page name again (“PRODUCTS”) and selecting “Duplicate Page.”

Image for post
Screenshot by Ednalyn C. De Dios

Rename the new page “TRANSACTIONS” and delete (or remove) the right-most table with order details on it.

Change the top-left visual and make update the fields as shown below. The “Fields” box should say “order_dow” while the top-left visual is activated.

Move the visuals around so it looks similar below.

Image for post
Screenshot by Ednalyn C. De Dios

Do the same thing for the next visual. This time, select “order_hour_of_day” and your Power BI Desktop should like the image below.

Image for post
Screenshot by Ednalyn C. De Dios

Do the same thing one last time for the last table and it should now contain fields as shown below.

Image for post
Screenshot by Ednalyn C. De Dios

Let’s add another page or tab to the report by clicking on the “+” icon at the bottom of the report’s main work area.

Image for post
Screenshot by Ednalyn C. De Dios

Basic Exploration

In the “Visualizations” panel, select “Stacked column chart.”

Image for post
Screenshot by Ednalyn C. De Dios

Resize the chart by drabbing their move-handles.

Make sure the “Axis” box contains “order_dow” and the “Values” box with “order_id” respectively. Power BI Desktop should automatically calculate the count for “order_id” and display the field as “Count of order_id” as shown below.

Image for post
Screenshot by Ednalyn C. De Dios

The graph above is interesting because it shows a higher number of orders for Day 0 and Day 1.

Let’s make another chart.

We will follow the same procedure of adding a chart and for this time, we’ll use “order_hour_of_day” in the “Axis” box as shown below.

Image for post
Screenshot by Ednalyn C. De Dios

The graph shows the peak time for the number of orders.

One last graph!

We will add another chart with “days_since_prior_order” in the “Axis” box.

Image for post
Screenshot by Ednalyn C. De Dios

This last graph is the most interesting because the number of reorders peaks during these three time periods: 7 days, 14 days, and 30 days since prior order. This means that people are in a habit of resupplying every week, every two weeks, and every month.

That’s it, folks!

In the next article, we will “prettify” our charts and make them more readable to the others.

The procedures above have been drawn out. But if you’re a novice Power BI user, don’t despair! With regular practice, the concepts demonstrated in this article will soon become second nature and you’ll probably be able to do them in your sleep.

Thank you for reading. If you want to learn more about my journey from slacker to data scientist, check out the article below:From Slacker to Data ScientistMy journey into data science without a degree.towardsdatascience.com

And if you’re thinking about switching gears and venture into data science, start thinking about rebranding now:The Slacker’s Guide to Rebranding Yourself as a Data ScientistOpinionated advice for the rest of us. Love of math, optional.towardsdatascience.com

Stay tuned!

You can reach me on Twitter or LinkedIn.

This article was first published in Towards Data Science’ Medium publication.

Forecasting in Power BI

A visual step-by-step guide to forecasting using Power BI.


In this post, we’ll go through the process of creating forecasting in Power BI.

Get the Data

You can download the dataset that I used here. It contains daily female births in California in 1959¹. For a list of other time-series datasets, check out Jason Brownlee’s article 7 Time Series Datasets for Machine Learning – Machine Learning Mastery.

Let’s load the data into Power BI. Open up Power BI and click on “Get data” on the welcome screen as shown below.

Image for post
Screenshot by the Author

Next, you’ll be presented with another pane that asks what type of data we want to get. Select “Text/CSV” as shown below and click on “Connect.”

Image for post
Screenshot by the Author

When the File Open window appears, navigate to where we saved the dataset and click on the “Open” button on the lower right-hand corner.

Image for post
Screenshot by the Author

When a preview appears, just click on “Load.”

Image for post
Screenshot by the Author

We’ll now see the main working area of Power BI. Head over to the “Visualizations” panel and look for “Line Chart.”

Image for post
.Screenshot by the Author

This is what the line chart icon looks like:

Image for post
Screenshot by the Author

Next, a visual placeholder will appear. Drag the hot corner marking on the lower right-hand corner of the placeholder and drag it diagonally down and to the right corner of the main working area.

Image for post
Screenshot by the Author
Image for post
Screenshot by the Author

Next, head over the “Fields” panel.

Image for post
Screenshot by the Author

With the line chart placeholder still selected, find the “Date” field and click on the square box to put a checkmark on it.

Image for post
Screenshot by the Author

We’ll now see the “Date” field under Axis. Click on the down arrow on the right of the “Date” as shown below.

Image for post
Screenshot by the Author

Select “Date” instead of the default, “Date Hierarchy.”

Image for post
Screenshot by the Author

Then, let’s put a checkmark on the “Births” field.

Image for post
Screenshot by the Author

We’ll now see a line graph like the one below. Head over the Visualization panel and under the list of icons, find the analytics icon as shown below.

Image for post
Screenshot by the Author

Scroll down the panel and find the “Forecast” section. Click on the down arrow to expand it if necessary.

Image for post
Screenshot by the Author

Next, click on “+Add” to add forecasting on the current visualization.

Image for post
Screenshot by the Author

We’ll now see a solid gray fill area and a line plot to the right of the visualization like the one below.

Image for post
Screenshot by the Author

Let’s change the Forecast length to 31 points. In this case, a data point equals a day so 31 would roughly equate to a month’s worth of predictions. Click on “Apply” on the lower right-corner of the Forecast group to apply the changes.

Image for post
Screenshot by the Author

Instead of points, let’s change the unit of measure to “Months” instead as shown below.

Image for post
Screenshot by the Author

Once we click “Apply,” we’ll see the changes in the visualization. The graph below contains forecast for 3 months.

Image for post
Screenshot by the Author

What if we wanted to compare how the forecast compares to actual data? We can do this with the “Ignore last” setting.

For this example, let’s ignore the last 3 months of the data. Power Bi will then forecast 3 months worth of data using the dataset but ignoring the last 3 months. This way, we can compare the Power BI’s forecasting result with the actual data in the last 3 months of the dataset.

Let’s click on “Apply” when we’redone changing the settings as shown below.

Image for post
Screenshot by the Author

Below, we can see how the Power BI forecasting compares with the actual data. The black solid line represents the forecasting while the blue line represents the actual data.

Image for post
Screenshot by the Author

The solid gray fill on the forecasting represents the confidence interval. The higher its value, the large the area will be. Let’s lower our confidence interval to 75% as shown below and see how it affects the graph.

Image for post
Screenshot by the Author

The solid gray fill became smaller as shown below.

Image for post
Screenshot by the Author

Next, let’s take into account seasonality. Below, let’s set it to 90 points which is equivalent to about 3 months. Putting this value will tell Power BI to look for seasonality within a 3-month cycle. Play with this value with what makes sense according to the data.

Image for post
Screenshot by the Author

The result is show below.

Image for post
Screenshot by the Author

Let’s return our confidence interval to the default value of 95% and scroll down the group to see formatting options.

Image for post
Screenshot by the Author

Let’s change the forecasting line to an orange color and let’s make the gray fill disappear by changing formatting to “None.”

Image for post
Screenshot by the Author

And that’s it! With a few simple clicks of the mouse, we got ourselves a forecast from the dataset.

Thank you for reading. If you want to learn more about my journey from slacker to data scientist, check out the article below:From Slacker to Data ScientistMy journey into data science without a degree.towardsdatascience.com

And if you’re thinking about switching gears and venture into data science, start thinking about rebranding now:The Slacker’s Guide to Rebranding Yourself as a Data ScientistOpinionated advice for the rest of us. Love of math, optional.towardsdatascience.com

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1] Machine Learning Mastery. (June 21, 2020). 7 Time Series Datasets for Machine Learning. https://machinelearningmastery.com/time-series-datasets-for-machine-learning/

This article was first published in Towards Data Science’ Medium publication.

Democratize Data Science

Every once in a while, I would come across an article that decries online data science courses and boot camps as pathways towards getting a data science job. Most of the articles aim not to discourage but serve as a reminder to take a hard look in the mirror first and realize what we’re up against. However, a few detractors have proclaimed that the proliferation of these online courses and boot camps have caused the degradation of the profession.

To the latter, I vehemently disagree.

Bridging the Skill Gap

Data science have captured popular imagination ever since Harvard Business Review dubbed data scientist as the sexiest job of the 21st century. More than seven years later, data science remains one of the most highly sought-after job markets today. In fact, due to the dynamics of supply and demand, “the United States alone is projected to face a shortfall of some 250,000 data scientists by 2024¹.”

As a result, capitalism and entrepreneurship answered the call and companies like Codeup have vowed to “help bridge the gap between companies and people wanting to enter the field.”²

In addition, AutoML libraries like PyCaret are “democratizing machine learning and the use of advanced analytics by providing free, open-source, and low-code machine learning solution for business analysts, domain experts, citizen data scientists, and experienced data scientists”³.

The availability of online courses, boot camps, and AutoML libraries has led a lot of data scientists to raise their brows. They fear that boot camp alumni and self-taught candidates would somehow lower the overall caliber of data scientists and disgrace the field. Furthermore, they are afraid that the availability of tools like AutoML would allow anyone to be a data scientist.

I mean, God forbid if anyone thinks that they too can be data scientists! Right?

Wrong.

The Street Smart Data Scientist

Alumni of boot camps and self-taught learners, like myself, have one thing going for them: our rookie smarts. To quote Liz Wiseman, author of the book Rookie Smarts:

In a rapidly changing world, experience can be a curse. Being new, naïve, and even clueless can be an asset. — Liz Wiseman

Rookies are unencumbered. We are alert and constantly seeking like hunter-gatherers, cautious but quick like firewalkers, and hungry and relentless like frontiersmen⁴. In other words, we’re street smart.

Many are so bogged down by “you’ve got to learn this” and “you’ve got learn that” that they forget to stress the fact that data science is so vast that you can’t possibly know everything about anything. And that’s okay.

We learn fast and adapt quickly.

At the end of the day, it’s all about the value that we bring to our organizations. They are, after all, the ones paying our bills. We don’t get paid to memorize formulas or by knowing how to code an algorithm from scratch.

We get paid to solve problems.

And this is where the street smart data scientist excels. We don’t suffer from analysis paralysis or be bothered with theories, at least not while on the clock. Our center of focus is based on pragmatic solutions to problems, not on academic debate.

This is not to say we’re not interested in the latest research. In fact, it’s quite the contrary. We are voracious consumers of the latest development in machine learning and AI. We drool over the latest development in natural language processing. And we’re always on the lookout for the latest tool that will make our jobs easier and less boring.

And AutoML

So what if we have to use AutoML? If it gets us to an automatic pipeline where analysts can get the results of machine learning without manual intervention by a data scientist, the better. We’re not threatened by automation, we’re exhilarated by it!

Do not let perfection be the enemy of progress. — Winston Churchill

By building an automatic pipeline, there’s bound to be some tradeoffs. But building it this way will free our brain cells and gives us more time to focus on solving other higher-level problems and produce more impactful solutions.

We’re not concerned about job security, because we know that it doesn’t exist. What we do know is that the more value we bring to a business, the better we will be in the long run.

Maybe They’re Right?

After all this, I will concede a bit. For the sake of argument, maybe they’re right. Maybe online courses, boot camps, and low-code machine learning libraries really do produce low-caliber data scientists.

Big maybe.

But still, I argue, this doesn’t mean we don’t have value. Data science skills lie on a spectrum and so does companies’ maturity when it comes to data. Why hire a six-figure employee when your organization barely has a recognizable machine learning infrastructure?

Again, maybe.

The Unicorn

Maybe, to be labeled as a data scientist, one must be a unicorn first. A unicorn data scientist is a data scientist who excels at all facets of data science.

Image for post
Hckum / CC BY-SA (https://creativecommons.org/licenses/by-sa/4.0)

Data science has long been described as the intersection between computer science, applied statistics, and business or domain knowledge. To this, they ask, how can one person possibly accumulate all those knowledge into just a few months? To this, we also ask the same question, how can a college grad?

Unicorns do exist I believe, but they also have had to start from somewhere.

So why can’t we?

Conclusion

A whole slew of online courses and tools promise to democratize data science, and this is a good thing.

Thank you for reading. If you want to learn more about my journey from slacker to data scientist, check out the article From Slacker to Data Scientist: My journey into data science without a degree.

And if you’re thinking about switching gears and venture into data science, start thinking about rebranding now The Slacker’s Guide to Rebranding Yourself as a Data ScientistOpinionated advice for the rest of us. Love of math, optional.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1] Harvard Business Review. (June 3, 2020). Democratizing Data Science in Your Organization. https://hbr.org/sponsored/2019/04/democratizing-data-science-in-your-organization

[2] San Antonio Express-News. (June 3, 2020). Software development bootcamp Codeup launching new data science program. https://www.mysanantonio.com/business/technology/article/Software-development-bootcamp-Codeup-launching-13271597.php

[3] Towards Data Science. (June 4, 2020). Machine Learning in Power BI Using PyCaret. https://towardsdatascience.com/machine-learning-in-power-bi-using-pycaret-34307f09394a

[4] The Wiseman Group. (June 4, 2020). Rookie Smarts
Why Learning Beats Knowing in the New Game of Work. 
https://thewisemangroup.com/books/rookie-smarts/

This article was first published in Towards Data Science’ Medium publication.

Programming Environment Setup

I was bored over the weekend so I decided to restore my Macbook Pro to factory settings so that I can set up my programming environment the proper way. After all, what’s a data scientist without her toys?

Let’s start with a replacement to the default terminal and pyenv installation to manage different Python versions.

Let’s move on to managing different Python interpreters and virtual environments using pyenv-virtualenv.

Topic Modeling on PyCaret

I remember a brief conversation with my boss’ boss a while back. He said that he wouldn’t be impressed if somebody in the company built a face recognition tool from scratch because, and I quote, “Guess what? There’s an API for that.” He then goes on about the futility of doing something that’s already been done instead of just using it.

This gave me an insight into how an executive thinks. Not that they don’t care about the coolness factor of a project, but at the end of that day, they’re most concerned about how a project will add value to the business and even more importantly, how quickly it can be done.

In the real world, the time it takes to build prototype matters. And the quicker we get from data to insights, the better off we will be. These help us stay agile.

And this brings me to PyCaret.


PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within seconds in your choice of notebook environment.[1]

Pycaret is basically a wrapper for some of the most popular machine learning libraries and frameworks scikit-learn and spaCy. Here are the things that PyCaret can do:

  • Classification
  • Regression
  • Clustering
  • Anomaly Detection
  • Natural Language Processing
  • Associate Rule Mining

If you’re interested in reading about the difference between traditional NLP approach vs. PyCaret’s NLP module, check out Prateek Baghel’s article.

Natural Language Processing

In just a few lines of code, PyCaret makes natural language processing so easy that it’s almost criminal. Like most of its other modules, PyCaret’s NLP module streamlined pipeline cuts the time from data to insights in more than half the time.

For example, with only one line, it performs text processing automatically, with the ability to customize stop words. Add another line or two, and you got yourself a language model. With yet another line, it gives you a properly formatted plotly graph. And finally, adding another line gives you the option to evaluate the model. You can even tune the model with, guess what, one line of code!

Instead of just telling you all about the wonderful features of PyCaret, maybe it’s be better if we do a little show and tell instead.


The Pipeline

For this post, we’ll create an NLP pipeline that involves the following 6 glorious steps:

  1. Getting the Data
  2. Setting up the Environment
  3. Creating the Model
  4. Assigning the Model
  5. Plotting the Model
  6. Evaluating the Model

We will be going through an end-to-end demonstration of this pipeline with a brief explanation of the functions involved and their parameters.

Let’s get started.


Housekeeping

Let us begin by installing PyCaret. If this is your first time installing it, just type the following into your terminal:

pip install pycaret

However, if you have a previously installed version of PyCaret, you can upgrade using the following command:

pip install —-upgrade pycaret

Beware: PyCaret is a big library so it’s going to take a few minutes to download and install.

We’ll also need to download the English language model because it is not included in the PyCaret installation.

python -m spacy download en_core_web_sm
python -m textblob.download_corpora

Next, let’s fire up a Jupyter notebook and import PyCaret’s NLP module:

#import nlp module
from pycaret.nlp import *

Importing the pycaret.nlp automatically sets up your environment to perform NLP tasks only.

Getting the Data

Before setup, we need to decide first how we’re going to ingest data. There are two methods of getting the data into the pipeline. One is by using a Pandas dataframe and another is by using a simple list of textual data.

Passing a DataFrame

#import pandas if we're gonna use a dataframe
import pandas as pd

# load the data into a dataframe
df = pd.read_csv('hilaryclinton.csv')

Above, we’re simply loading the data into a dataframe.

Passing a List

# read a file containing a list of text data and assign it to 'lines'
with open('list.txt') as f:
    lines = f.read().splitlines()

Above, we’re opening the file 'list.txt' and reading it. We assign the resulting list into the lines.

Sampling

From the rest of this experiment, we’ll just use a dataframe to pass textual data to thesetup() function of the NLP module. And for the sake of expediency, we’ll sample the dataframe to only select a thousand tweets.

# sampling the data to select only 1000 tweets
df = df.sample(1000, random_state=493).reset_index(drop=True)

Let’s take a quick look at our dataframe with df.head() and df.shape.

Setting Up the Environment

In the line below, we’ll initialize the setup by calling the setup() function and assign it to nlp.

# initialize the setup
nlp = setup(data = df, target = 'text', session_id = 493, custom_stopwords = [ 'rt', 'https', 'http', 'co', 'amp'])

With data and target, we’re telling PyCaret that we’d like to use the values in the 'text' column of df. Also, we’re setting the session_id to an arbitrary number of 493 so that we can reproduce the experiment over and over again and get the same result. Finally, we added custom_stopwords so that PyCaret will exclude the specified list of words in the analysis.

Note that if we want to use a list instead, we could replace df with lines and get rid of target = ‘text’ because a list has no columns for the PyCaret to target!

Here’s the output of nlp:

The output table above confirms our session id, number of documents (rows or records), and vocabulary size. It also shows up if we used custom stopwords or not.

Creating the Model

Below, we’ll create the model by calling the create_model() function and assign it to lda. The function already knows to use the dataset that we specified during setup(). In our case, the PyCaret knows we want to create a model based on the 'text' in df.

# create the model
lda = create_model('lda', num_topics = 6, multi_core = True)

In the line above, notice that w param used 'lda' as the parameter. LDA stands for Latent Dirichlet Allocation. We could’ve just as easily opted for other types of models.

Here’s the list of models that PyCaret currently supports:

  • ‘lda’: Latent Dirichlet Allocation
  • ‘lsi’: Latent Semantic Indexing
  • ‘hdp’: Hierarchical Dirichlet Process
  • ‘rp’: Random Projections
  • ‘nmf’: Non-Negative Matrix Factorization

I encourage you to research the difference between the models above, To start, check out Lettier’s awesome guide on LDA.

The next parameter we used is num_topics = 6. This tells PyCaret to use six topics in the results ranging from 0 to 5. If num_topic is not set, the default number is 4. Lastly, we set multi_core to tell PyCaret to use all available CPUs for parallel processing. This saves a lot of computational time.

Assigning the Model

By calling assign_model(), we’re going to label our data so that we’ll get a dataframe (based on our original dataframe: df) with additional columns that include the following information:

  • Topic percent value for each topic
  • The dominant topic
  • The percent value of the dominant topic
# label the data using trained model
df_lda = assign_model(lda)

Let’s take a look at df_lda.

Plotting the Model

Calling the plot_model() function will give us some visualization about frequency, distribution, polarity, et cetera. The plot_model() function takes three parameters: model, plot, and topic_num. The model instructs PyCaret what model to use and must be preceded by a create_model() function. topic_num designates which topic number (from 0 to 5) will the visualization be based on.

plot_model(lda, plot='topic_distribution')
plot_model(lda, plot='topic_model')
plot_model(lda, plot='wordcloud', topic_num = 'Topic 5')
plot_model(lda, plot='frequency', topic_num = 'Topic 5')
plot_model(lda, plot='bigram', topic_num = 'Topic 5')
plot_model(lda, plot='trigram', topic_num = 'Topic 5')
plot_model(lda, plot='distribution', topic_num = 'Topic 5')
plot_model(lda, plot='sentiment', topic_num = 'Topic 5')
plot_model(lda, plot='tsne')

PyCarets offers a variety of plots. The type of graph generated will depend on the plot parameter. Here is the list of currently available visualizations:

  • ‘frequency’: Word Token Frequency (default)
  • ‘distribution’: Word Distribution Plot
  • ‘bigram’: Bigram Frequency Plot
  • ‘trigram’: Trigram Frequency Plot
  • ‘sentiment’: Sentiment Polarity Plot
  • ‘pos’: Part of Speech Frequency
  • ‘tsne’: t-SNE (3d) Dimension Plot
  • ‘topic_model’ : Topic Model (pyLDAvis)
  • ‘topic_distribution’ : Topic Infer Distribution
  • ‘wordcloud’: Word cloud
  • ‘umap’: UMAP Dimensionality Plot

Evaluating the Model

Evaluating the models involves calling the evaluate_model() function. It takes only one parameter: the model to be used. In our case, the model is stored is lda that was created with the create_model() function in an earlier step.

The function returns a visual user interface for plotting.

And voilà, we’re done!

Conclusion

Using PyCaret’s NLP module, we were able to quickly from getting the data to evaluating the model in just a few lines of code. We covered the functions involved in each step and examined the parameters of those functions.


Thank you for reading! PyCaret’s NLP module has a lot more features and I encourage you to read their documentation to further familiarize yourself with the module and maybe even the whole library!

In the next post, I’ll continue to explore PyCaret’s functionalities.

If you want to learn more about my journey from slacker to data scientist, check out the article here.

Stay tuned!

You can reach me on Twitter or LinkedIn.


[1] PyCaret. (June 4, 2020). Why PyCaret. https://pycaret.org/Towards Data