Chapter 3 Data
3.1 Sources
3.1.1 Museum Data Source
The Museum data is from the official website of the Institute of Museum and Library Services(IMLS), and we use the latest version which was collected in 2018. There are three csv files that contain different type of museums. The second file just contains GMU museums, the third file just contains HSC museums.
The three files contain 7431, 7961, 14786 museums respectively, and each file has 58 variables, including some basic information, like the id, location, contact, ruledate information, and the income and revenue information.
And we choose to mainly analyse the state, discipline, income and the established dates, therefore, we select the relevant columns to conduct our analysis. The discipline contains 9 values, reflecting 9 kinds of museums; we choose IncomeCD15 to present museum income, which 9 ranges of income from 0 to 50,000,000 and greater; rule_date, the registered nonprofit organization was recognized as having obtained their formal tax-exempt status by the IRS, we use this column to present the established date.
3.1.2 Library Data Source
The library data set is also from https://www.imls.gov/ which is the website of the Institute of Museum and Library Service. This institute provides several data set regarding museum and library across the nation. We use the data set of The Public Libraries Survey (PLS) which supplied annually by public libraries across the country on when, where, and how library services are changing to meet the needs of the public.
The data was collected by the institute. The survey is administered by Data Coordinators and the requested data was collected from local public libraries. There is a web-based reporting system State Data Coordinators used report data to the institute.
The survey was conducted annually since 1988 and the website provides data files since 1992. During past years, the content on the survey changed 3 times and the format of the data sets are slightly differences.
As for the data file, the data files are available on the web site and is a public-used data set which means some of the data was removed aimed to protect the confidentiality of individually identifiable survey respondents. Our project use this public-used data set which was encoded and pre-processed and provides a documentation file to help the user correctly understand the data set and give detailed information about the data set.
The data set provides statistics on the status of public libraries in the United States collected from approximately 9,000 public libraries with approximately 17,000 individual public library outlets in the 50 states, the District of Columbia, and outlying territories. The data set contains information about library visits, circulation, size of collections, public service hours, staffing, electronic resources, operating revenues and expenditures and number of service outlets. We basically use the state wide data which contains 55 rows and more than 100 variables.
3.2 Cleaning / transformation
3.2.1 Museum Data Cleaning
We firstly select the columns that we need from three files and then modify the data type of ID from string to numeric to match each other and then merge them together.
Then we created different data frames group by state, discipline, income, and decades of the rule date to further show the distribution of museums in different disciplines and states, the relationships between variables, and the museum growth trends from 1900s to 2010s.
3.2.2 Library Data Cleaning
The data format in the library data set is quite clean and suitable for us to use. Thus, we didn’t do much on cleaning them but only went through the variables in the data set and select the variables that we are interested in. The state data in the data set is coded. As a result, we used another data set which contains the name of state and the code of state and combine them to get the state name we need.
3.3 Missing value analysis
3.3.1 Museum Data
There is no missing data in discipline and state, but there are 8677(28.8%) empty value in incomecd and 8682(28.8%) empty value in rule date.
Most missing cases are missing rule date and incomecd together, which means missing these two variables are correlated. If MID is missing, we treat it as a invalid museum record delete the record.
## MID DISCIPLINE ADSTATE INCOMECD15 RULEDATE15
## 3 0 0 8677 8682
We can see that after removing missing data, the rest of data is still representative since we can restore the shape of the map of the USA by plotting the rest of the data according to each pair of the location variable. This means that data collection covers each state and the amount of missing data is reasonable because the rest of data keeps the key geographic and discipline information. Hence, it’s valid to explore the distribution of disciplines and locations in the following sections by using the rest of data.
3.3.2 Library Data
We first visualize the missing data in our library data set and discuss the pattern it has. This is an important step for us to under stand our data.
- Library data of 2020
In this project, we first analyze the data of Library in US in 2020. According to the document of the data set, all the missing data is filled with ‘-1’. Thus, we will visualize the distribution of ‘-1’ in this data set.
As we can see from the graph, the only row has missing data is the state with code ‘VI’ which is Virgin Islands. All other data is completed. As this state is an outlying area of the United States and missing data are primarily some statistic data about the revenue and resource data of the library, we believe it is reasonable to keep the data as ‘-1’ in these rows which have little effect on our visualization.
This graph shows the pattern further which is most rows (data of different state) is complete and provide us a relatively good data quality to analyze and visualize the libraries in US.
- Time series missing data analysis
Moreover, to discuss the changing pattern of libraries in US, we collect library data in different years. Although we can easily read in all the raw data, band them together and select the variables we are interested in, we believe that to store such a large set of data makes the requirement of space of this project too big and the transforming progress takes a lot of time. As a result, we choose the variable in each years’ data and store them in a csv file. As the data update once a year, it will not take too much time to add new data to this file. As for the data format, we keep it same with the original data and all the missing data is filled with ‘-1’.
As we can see from the graph, there are three states having missing data through the years which are GU(Guam), MP(Northern Mariana Islands) and VI(Virgin Islands). All of these states are outlying area of the United States. Thus, we keep these data with its original form.
As we can see in this chart, among all the rows having missing data, some of them missing quite a lot of data while others only miss one or two data. As we result, we would like to further see the missing data. We select all the rows of the three states and data is as below:
## STABR YEAR POPU_LSA CENTLIB BRANLIB BKMOB BKVOL EBOOK VISITS
## 1 GU 2020 168678 1 5 0 286930 0 35115
## 2 MP 2020 53883 1 2 1 83624 5500 44710
## 3 VI 2020 106405 0 5 3 176762 -1 -1
## 4 GU 2019 168678 1 5 0 278963 0 100095
## 5 MP 2019 51433 1 2 1 82199 5000 147983
## 6 VI 2019 106405 0 5 3 176762 -1 -1
## 7 GU 2018 167358 1 5 0 276631 0 75119
## 8 MP 2018 51994 1 2 1 92000 5000 126778
## 9 VI 2018 106405 0 4 3 -1 -1 -1
## 10 GU 2017 164229 1 5 0 272720 200 81572
## 11 MP 2017 53883 1 2 0 92141 5000 104224
## 12 GU 2016 159358 1 5 0 269270 200 71813
## 13 GU 2015 159358 1 5 0 266695 200 72223
## 14 GU 2014 159358 1 5 0 263486 1335 103593
## 15 MP 2014 53883 1 1 0 -1 -1 -1
## 16 VI 2014 106405 0 5 1 -1 -1 -1
## 17 GU 2013 159358 1 5 1 260586 0 98969
## 18 MP 2013 53883 1 1 0 -1 -1 -1
## 19 VI 2013 106405 0 5 1 -1 -1 -1
## 20 GU 2012 159358 1 5 1 258241 1278 75472
## 21 MP 2012 53883 1 1 0 -1 -1 -1
## 22 VI 2012 106405 0 5 1 -1 -1 -1
## 23 GU 2011 159358 1 5 1 255277 0 80016
## 24 MP 2011 53883 1 1 0 -1 -1 -1
## 25 VI 2011 106405 0 5 1 -1 -1 -1
## 26 GU 2010 180692 1 5 1 204503 0 84019
## 27 MP 2010 53883 1 1 0 -1 -1 -1
## 28 VI 2010 106405 0 5 1 -1 -1 -1
## 29 GU 2009 175459 1 5 1 210079 0 60763
## 30 MP 2009 -1 -1 -1 -1 -1 -1 -1
## 31 VI 2009 -1 -1 -1 -1 -1 -1 -1
## 32 GU 2008 25984 1 5 1 211772 -1 70061
## 33 MP 2008 -1 -1 -1 -1 -1 -1 -1
## 34 VI 2008 -1 -1 -1 -1 -1 -1 -1
## 35 GU 2007 -1 -1 -1 -1 -1 -1 -1
## 36 MP 2007 -1 -1 -1 -1 -1 -1 -1
## 37 VI 2007 -1 -1 -1 -1 -1 -1 -1
## 38 GU 2006 -1 -1 -1 -1 -1 -1 -1
## 39 MP 2006 -1 -1 -1 -1 -1 -1 -1
## 40 VI 2006 -1 -1 -1 -1 -1 -1 -1
From the table, we can see that in some year(2016,2015), two of the states didn’t report the data and among all the reported data, most of the missing data is in early years. As we would like to see the trend of the data, we also keep the original data in this data set in stead of filling them in some way.