Structure of the Data: |
Data Size (rough estimation): 361,162kb (26 column and more than 1,000,000 rows )
Data Fields of the data and Brief Explanation: Crime Code: Indicates the crime committed. Crime Code Description: Defines the Crime Code provided. Victim Age: Two numeric characters Victim Sex: F – Female M – Male X – Unknown Victim Descent: Descent Code: A – Other Asian B – Black C – Chinese D – Cambodian F – Filipino G – Guamanian H – Hispanic/Latin/Mexican I – American Indian/Alaskan Native J – Japanese K – Korean L – Laotian O – Other P – Pacific Islander S – Samoan U – Hawaiian V – Vietnamese W – White X – Unknown Z – Asian Indian Premise Code: The type of structure, vehicle, or location where the crime took place. Premise Description: Defines the Premise Code provided. Weapon Used Code: The type of weapon used in the crime. Weapon Description: Defines the Weapon Used Code provided. Address: Street address of crime incident rounded to the nearest hundred block to maintain anonymity. Cross Street: Cross Street of rounded Address. Location: The location where the crime incident occurred. Actual address is omitted for confidentiality. XY coordinates reflect the nearest 100 block.
|
Analysis Procedure You Plan to take: |
We would like to utilize some libraries for python programme language to solve the problem, such as xlrd, numpy, scipy, pandas, Statsmodels, matplotlib, seaborn, sklearn and folium. We would like to highlight the logic of our ideas with the Project Analysis Project Process.
1. Identify Problem: ● Identify Purposive Database: Find out the frequency of the crime happened in the different area ● Identify Basic Process: Analyze where criminals always appear to commit their crime. These analyzed data can help us to ○ Identify the seriousness of different crimes; ○ Sum up the results and; ○ Provide suggestions to tourists when and where are more safe to travel.
2. Data Collection: ● Download Database: Download data source from the website – Data.Gov. ● Describe Dataset: The data file contains 26 variables. We only extract data related to the problem such as variable Area (Name, District, Address), Crime code, crime code description, victim Age and victim sex that help us with to ○ develop right kind of strategies (e.g what packages to use) for analysis.
3. Data Preparation [numpy/ pandas]: ● Identify the Dependent Variable: Separated into two groups: independent and dependent variable. ○ For instance, we try to use crime code as the dependent variable and set a particular crime code (eg., “Yes” as default and all others become “No”. The crime code will then be used to applying different packages for building better models. ● Handling Missing Value: As data file always incomplete that some variables contain null values. Utilize the Pandas package and create Dataframe with it for dropping or filling empty space.
4. Initial Look (Simple Chart) :Data manipulation & Look 4.1 Plot/ histogram [pandas]: – Use the date occurred and separate into different time periods (eg, 08:00-12:00 etc) or months – Count the number of occurrence of crimes (pd.to_numeric) – Create of dataframe – Take a look at the data by using df.head/tail/shape – Plot the histogram(hist)/pivot table to show the crime occurred in different time period – Investigate which time period or months would be the most risky 4.2 Statistical Analysis [numpy, Statsmodels]: – Set Y=crime code as dependant variable – Find the p-value, to see which variables would be significant, eg the X= victim’s gender/ age would be significant variables, are considered related with crime code.
5. Modeling [sklearn] – 5.1. Data Compilation: Find out the probability of each crime (According to crime code) – 5.2. Training a model on the data: Split the dataset into Training Data and Testing Data – 5.3. Featuring Generation or Selection: Identifying the selected variable for the following modeling – 5.4. Model Designing and Selection: Select the algorithms (supervised learning or Unsupervised learning) and modeling method (Classification, Clustering, Regression or Dimensionality Reduction) – For example, we would like to use the unsupervised learning on the classification. First, we would like to separate the crime into 4 major types, which are A, Homicide; B, wounding assault; C, criminal damage; D: theft. – 5.5. Parameter Tuning: Removing some outliers – 5.6. Evaluation: Choose the best model result for future predictions (comparing with each results, relevance and explanation) 6. Reporting & Visualization [matplotlib/ sklearn / folium] – Time Series: Utilize the matplotlib to displays the linear graph, shows the relationship between the time and crime occurred – Date Series: Utilize the matplotlib to displays the bar charts, shows the relationship between the time and crime occurred – Location: Utilize folium to create a heat map of the location where the crime happened. – Clustering: Utilize the matplotlib to displays the result from sklearn, shows the relationship between the dependent variable (4 major types of crime) and other selected independent variable. |