I was going to be a Data Science Librarian!

Author

Breanna E. Green

Published

October 21, 2022

So, there comes a time in every young PhD student’s life when they say, “Nah… this ain’t it”. That is, you’ll hear a ton of people who decide the doctoral program isn’t for them. This could be due to a slew of reasons, including (non-exhaustive) they’ve lost passion, got burnout, found new interests, or got really tired of being broke and tired. I experienced what I like to call my PhD Quarter Life Crisis – that weird time between your second and third year where you may feel aimless and simply inadaquate. I couldn’t tell if I was doing enough. My progress felt (and continues to feel) very slow growing. I desperately want to publish something, ANYTHING. But it’s hard to get a project to that point.

So I started applying for jobs. Particularly roles that would 1) bring me back home to Texas and have me closer to friends & family, 2) looked interesting and where I knew I would excel, and 3) would keep me in academia (bonus: a salary doesn’t hurt).

Thus, I present to you a mini-project I did to look up academic roles at public Texas institutions of higher education.

Note: All data used here is publically available by Texas mandate. File can be found here.

Other interesting data sources:

- https://texascollegesalaries.com/institutions
- https://salaries.texastribune.org/departments/library-and-archives-commission/
- https://govsalaries.com/state/TX
Code
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', None)
Code
### Again, I was looking at a Data Science Librarian Role (however, at a non-public university!)

### read in data
uni_ = pd.read_csv('TCS_All_Data_4-28-22.csv', thousands=',')

### update hire date to datetime
uni_['hire_date_dt'] = pd.to_datetime(uni_['hire_date'], errors='coerce')
uni_['race'] = uni_.race.str.lower()

### add 2 "inflation" columns
### universities (in Texas at least) often increase wages to keep up with inflation
### I wanted to project 3 years forward
uni_['salary2'] = pd.to_numeric(uni_['salary'])
uni_['salary3'] = (uni_['salary2']*.028)+(uni_['salary2'])
uni_['salary4'] = (uni_['salary3']*.028)+(uni_['salary3'])
Code
### look at some data
display(uni_.shape, uni_.head())

### What schools do we have?
display(sorted(uni_.agency.unique()))
(80009, 15)
agency department full_name job_title employment_time race gender hire_date salary id data_date hire_date_dt salary2 salary3 salary4
0 UT Austin Accounting Soren Aandahl Lecturer Part-time white Male 1/16/2021 88000.0 192000 3/31/2022 2021-01-16 88000.0 90464.000 92996.992000
1 UT Austin Marketing Christopher Aarons Assistant Professor of Instruction Full-time white Male 9/1/2019 64711.0 192001 3/31/2022 2019-09-01 64711.0 66522.908 68385.549424
2 UT Austin Computer Science Scott J Aaronson Professor Full-time white Male 4/28/2016 251500.0 192002 3/31/2022 2016-04-28 251500.0 258542.000 265781.176000
3 UT Austin UTeach-Natural Sciences Vivian Abagiu Communications Coordinator Full-time hispanic or latino Female 10/15/2015 72000.0 192003 3/31/2022 2015-10-15 72000.0 74016.000 76088.448000
4 UT Austin University of Texas Elementary School Anna Marie Abalos Food Preparation/Service Worker Part-time hispanic or latino Female 8/19/2021 31200.0 192004 3/31/2022 2021-08-19 31200.0 32073.600 32971.660800
['Austin Community College',
 'Lamar University',
 'Lone Star Community College',
 'Sam Houston State University',
 'Sul Ross',
 'TAMU Commerce',
 'TAMU Prairie View',
 'Texas A&M',
 'Texas A&M Health Science Center',
 'Texas State University',
 'Texas Tech',
 "Texas Woman's University",
 'UNT',
 'UT Arlington',
 'UT Austin',
 'UT Dallas',
 'UT Permian Basin',
 'UT San Antonio',
 'UT Tyler',
 'University of Houston',
 'University of Houston System']

We don’t have all the TX schools, but this is great! Let’s dig around.

Code
### Looking for 'lib'rary roles (without much discrimination) and Full-Time
FT_lib = uni_[uni_.job_title.str.contains('Lib') & uni_.employment_time.str.contains('Full')].copy(deep=True)
FT_lib.sort_values(by=['race', 'gender','salary',], inplace=True)
display(FT_lib.shape, FT_lib.head())
(804, 15)
agency department full_name job_title employment_time race gender hire_date salary id data_date hire_date_dt salary2 salary3 salary4
74223 Sam Houston State University Newton Gresham Library Sarah I. Sellers Library Associate Full-Time american indian or alaskan native Female 11/1/2017 35040.00 184691 3/28/2022 2017-11-01 35040.00 36021.12000 37029.711360
406 Texas State University University Libraries Karen E Cowen Library Assistant IV Full-time american indian or alaskan native Female 8/21/2000 49128.60 74505 8/20/2021 2000-08-21 49128.60 50504.20080 51918.318422
59769 Texas Woman's University Library Julie Reed Sullivan Librarian Digital Content Full-Time american indian or alaskan native Female 2/22/1993 61055.00 211345 3/22/2022 1993-02-22 61055.00 62764.54000 64521.947120
418 Texas State University University Libraries Elizabeth Karen Cruces Librarian Full-time american indian or alaskan native Female 3/1/2021 62499.96 75561 8/20/2021 2021-03-01 62499.96 64249.95888 66048.957729
74696 Sam Houston State University Newton Gresham Library Akira Y. Wu Library Assistant II Full-Time asian Female 12/1/2020 33360.00 185163 3/28/2022 2020-12-01 33360.00 34294.08000 35254.314240
Code
### Let's also check out data related roles, full-time
FT_data = uni_[uni_.job_title.str.contains('Data') & uni_.employment_time.str.contains('Full')].copy(deep=True)
FT_data.sort_values(by=['race', 'gender','salary',], inplace=True)
display(FT_data.shape, FT_data.head())
(195, 15)
agency department full_name job_title employment_time race gender hire_date salary id data_date hire_date_dt salary2 salary3 salary4
74160 Sam Houston State University IT Infrastructure and Support Victorria A. Saldana Data Center Operations Spec I Full-Time asian Female 11/1/2021 41616.00 184628 3/28/2022 2021-11-01 41616.00 42781.24800 43979.122944
1122 Texas A&M Dean Xiaoping Li Data Analyst Full-Time asian Female 7/2/2018 48502.44 155323 3/1/2022 2018-07-02 48502.44 49860.50832 51256.602553
1181 Texas A&M Tamu Libraries Ethelyn V Mejia Data Analyst Full-Time asian Female 6/1/2005 51188.16 156457 3/1/2022 2005-06-01 51188.16 52621.42848 54094.828477
57665 UT Austin Governmental Affairs and Initiatives Susan Yuanyuan Whitman Database Coordinator Full-time asian Female 9/2/2008 55000.00 208164 3/31/2022 2008-09-02 55000.00 56540.00000 58123.120000
572 Texas A&M Office Of Admissions Liu Shi Senior Data Analyst Full-Time asian Female 9/1/2008 60009.96 147134 3/1/2022 2008-09-01 60009.96 61690.23888 63417.565569
Code
FT_data.sort_values(by=['salary']).head()
agency department full_name job_title employment_time race gender hire_date salary id data_date hire_date_dt salary2 salary3 salary4
58445 Texas Woman's University Library Dolores Aguilar Data Entry Oper II Full-Time hispanic or latino Female 12/1/2006 24046.00 210021 3/22/2022 2006-12-01 24046.00 24719.28800 25411.428064
29916 Sul Ross Student Support Services Department Lisa Griffith Data Tracking Admin Specialist Full-Time white Female 9/25/2018 24293.00 62890 8/13/2020 2018-09-25 24293.00 24973.20400 25672.453712
30047 Sul Ross Trio Talent Search Stephanie Weintraut Talent Search Data Spec/Sec Full-Time white Female 3/16/2020 24341.00 63214 8/13/2020 2020-03-16 24341.00 25022.54800 25723.179344
5369 UT Tyler Admissions Brittany Johnson Admissions Data Spec I Full-Time black or african american Female 5/16/2013 30514.32 27456 7/29/2020 2013-05-16 30514.32 31368.72096 32247.045147
19931 Texas State University Materials Mgmt & Logistics Jennifer Ann Mireles Data Entry Operator Full-time white Female 5/10/2010 31985.76 75435 8/20/2021 2010-05-10 31985.76 32881.36128 33802.039396
Code
### Community Colleges (with community in the name) all roles

CC_ = uni_[uni_.agency.str.contains('Community')]
display(CC_.shape, CC_.head())
(5223, 15)
agency department full_name job_title employment_time race gender hire_date salary id data_date hire_date_dt salary2 salary3 salary4
174 Austin Community College Marketing Nicholas Sarantakes Professor Full-time white Male 8/24/1981 142052.00 189000 3/29/2022 1981-08-24 142052.00 146029.45600 150118.280768
8323 Austin Community College Physics Paul Edward Williams Professor Full-time white Male 8/16/2004 88776.00 189086 3/29/2022 2004-08-16 88776.00 91261.72800 93817.056384
11729 Austin Community College Biology Felix S Villarreal Professor Full-time hispanic or latino Male 9/1/1992 89047.00 189087 3/29/2022 1992-09-01 89047.00 91540.31600 94103.444848
12866 Austin Community College Biology Sarah L Strong Professor Full-time white Female 8/23/1993 118117.00 189261 3/29/2022 1993-08-23 118117.00 121424.27600 124824.155728
74587 Lone Star Community College Cisco Donna Ivey Dir CISCO Prog Full-Time white Female 7/1/2010 102625.94 186427 3/24/2022 2010-07-01 102625.94 105499.46632 108453.451377
Code
### What types of Assistant roles makes between 55k-75k at these community colleges?

display(CC_[CC_.job_title.str.contains('Assistant') & (CC_.salary2.between(55000,75000))].sort_values(by=['salary'], ascending=False).head())
agency department full_name job_title employment_time race gender hire_date salary id data_date hire_date_dt salary2 salary3 salary4
79903 Austin Community College Sonography Joel Thurman Professor, Assistant Full-time white Male 8/9/2021 73972.0 191346 3/29/2022 2021-08-09 73972.0 76043.216 78172.426048
79718 Austin Community College Library Services Jorge Lopez-McKnight Professor, Assistant Full-time hispanic or latino Male 10/14/2018 73972.0 191161 3/29/2022 2018-10-14 73972.0 76043.216 78172.426048
77959 Austin Community College Sonography Sherri Lynn Professor, Assistant Full-time white Female 8/19/2019 73972.0 189402 3/29/2022 2019-08-19 73972.0 76043.216 78172.426048
79631 Austin Community College Library Services Christina M McCourt Professor, Assistant Full-time hispanic or latino Female 10/2/2017 72749.0 191074 3/29/2022 2017-10-02 72749.0 74785.972 76879.979216
78779 Austin Community College Emergency Med Svcs Professions Neia D Hoffman Professor, Assistant Full-time white Female 5/23/2016 72749.0 190222 3/29/2022 2016-05-23 72749.0 74785.972 76879.979216

Let’s look at the top salaries for select roles

Herre you can see the top 10 ‘Data’ related roles, along with the boxplots by race and gende (see @box_plots_race or @box_plots_gender below).

Code
### Describe Data
import seaborn as sns
cmap = sns.light_palette("#34A853", as_cmap=True)


### Change this as you see fit. Suggestions: 'Data', 'Assistant', 'Librar*'
Role = 'Data'
num_ = 10

print('Top {} {} Roles: \n\n'.format(num_, Role))
display(uni_[uni_.job_title.str.contains(Role)].sort_values(by=['salary'], ascending=False).head(num_).style.background_gradient(cmap=cmap, subset=['salary']))

print('\n\n','----'*50, '\n\n')

display(uni_[uni_.job_title.str.contains(Role)].groupby(['race', 'gender'])[['salary']].describe().style.background_gradient(cmap=cmap, subset=[('salary',  'mean')]))


print('\n\n','----'*50, '\n\n')
Top 10 Data Roles: 

agency department full_name job_title employment_time race gender hire_date salary id data_date hire_date_dt salary2 salary3 salary4
46963 Texas A&M Texas Real Estate Research Center Gerald A Klassen Research Data Scientist Full-Time white Male 11/28/2005 223485.600000 149201 3/1/2022 2005-11-28 00:00:00 223485.600000 229743.196800 236176.006310
19476 Texas State University Office of Institutional Research Tami Lynn Rice Dir, System Data & Analysis Full-time white Female 12/1/1997 142124.880000 74397 8/20/2021 1997-12-01 00:00:00 142124.880000 146104.376640 150195.299186
79868 Austin Community College Information Technology David Cantu Sr. Manager, Data Governance Full-time hispanic or latino Male 5/17/2021 129273.000000 191311 3/29/2022 2021-05-17 00:00:00 129273.000000 132892.644000 136613.638032
24980 UT Austin IQ - Information Quest Darren S Holm Senior Database Administrator Full-time white Male 11/10/2014 127335.000000 198732 3/31/2022 2014-11-10 00:00:00 127335.000000 130900.380000 134565.590640
42959 Texas A&M Office Of Institutional Effectiveness Rajeeb L Das Senior Data Scientist Full-Time hispanic or latino Male 12/2/2019 126284.160000 147034 3/1/2022 2019-12-02 00:00:00 126284.160000 129820.116480 133455.079741
71763 UT Arlington University Analytics Lisa Creed MANAGER, Partnerships & Data Not Provided not provided Not Provided 9/8/2020 123600.000000 145604 2/15/2022 2020-09-08 00:00:00 123600.000000 127060.800000 130618.502400
32501 UNT Data Analytics & Instl Rsrch Daniel J Hubbard Director, Data Management Full-Time white Male 2/22/2017 123000.000000 179862 3/29/2022 2017-02-22 00:00:00 123000.000000 126444.000000 129984.432000
64495 UT Arlington OIT Enterprise Data Services Paul Savoy Sr Database Administrator Not Provided not provided Not Provided 12/12/2016 121616.000000 141732 2/15/2022 2016-12-12 00:00:00 121616.000000 125021.248000 128521.842944
1803 University of Houston Enterprise Systems Zeandra Mathura ES Database Adminstrator 4 Full-Time asian Female 6/1/2016 116857.800000 120609 2/18/2022 2016-06-01 00:00:00 116857.800000 120129.818400 123493.453315
6885 University of Houston Enterprise Systems Carol Pena ES Database Adminstrator 4 Full-Time hispanic or latino Female 2/11/2008 114880.440000 120659 2/18/2022 2008-02-11 00:00:00 114880.440000 118097.092320 121403.810905


 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 

salary
count mean std min 25% 50% 75% max
race gender
asian Female 22.000000 79189.720000 21915.379994 41616.000000 60969.120000 80713.320000 97159.000000 116857.800000
Male 18.000000 86061.783889 20387.068553 52000.000000 75340.432500 90175.860000 102901.000000 110560.320000
black or african american Female 8.000000 66522.936250 16022.700192 30514.320000 63750.000000 72318.585000 74150.000000 81690.000000
Male 3.000000 87640.000000 19325.716028 69770.000000 77385.000000 85000.000000 96575.000000 108150.000000
hispanic or latino Female 7.000000 53637.491429 31630.273709 24046.000000 34779.500000 41124.000000 62926.500000 114880.440000
Male 19.000000 70699.188947 29236.914059 34680.000000 50260.020000 60000.000000 90190.495000 129273.000000
native hawaiian or other pacific islander Female 1.000000 43296.000000 nan 43296.000000 43296.000000 43296.000000 43296.000000 43296.000000
not provided Female 3.000000 43594.666667 16738.365671 32000.000000 34000.000000 36000.000000 49392.000000 62784.000000
Male 4.000000 71264.957500 32838.361957 42861.000000 43677.480000 67508.410000 95095.887500 107182.010000
Not Provided 40.000000 70903.949500 26333.239253 26450.000000 50600.000000 65693.115000 93155.000000 123600.000000
two or more races Female 3.000000 61697.586667 18196.965722 45000.000000 52000.000000 59000.000000 70046.380000 81092.760000
Male 2.000000 60190.040000 11582.465644 52000.000000 56095.020000 60190.040000 64285.060000 68380.080000
white Female 49.000000 59937.559388 22755.208720 24293.000000 43703.000000 54200.000000 71400.000000 142124.880000
Male 68.000000 74908.674118 30083.636671 34000.000000 51940.500000 72003.500000 92756.257500 223485.600000


 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 

Code
### Visualize!

boxprops = dict(linewidth=1.5, color='pink')
medianprops = dict(linestyle='-.', linewidth=2.5, color='firebrick')

# Creating boxplot
fig, ax = plt.subplots(figsize =(18, 6))
# Remove top and right border
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)

uni_[uni_.job_title.str.contains(Role)].boxplot(column='salary', by=['race'], ax=ax, boxprops=boxprops, medianprops=medianprops)
plt.suptitle("Boxplot for {} by Race".format(Role))
plt.show()

Box Plots for a selected role by Race

Code
### Visualize!

boxprops = dict(linewidth=1.5, color='pink')
medianprops = dict(linestyle='-.', linewidth=2.5, color='firebrick')


fig, ax = plt.subplots(figsize =(18, 6))
# Remove top and right border
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)

uni_[uni_.job_title.str.contains(Role)].boxplot(column='salary', by=['gender'], ax=ax,  boxprops=boxprops, medianprops = dict(linestyle='-.', linewidth=2.5, color='green'))
plt.suptitle("Boxplot for {} by Gender".format(Role))
plt.show()

Box Plots for a selected role by Gender

Return to Blog Home Page