Python analysis of Welsh crimes and punishment

This is designed as a follow-on activity to our Introducing Python for Data Science.

In this series of challenges, you'll be working with a real historical dataset: Welsh crime and punishment records from 1730 to 1830, kindly provided by The National Library of Wales. These records offer a fascinating glimpse into everyday life, social structures, and the justice system of the time - and you'll be using Python to uncover patterns hidden within them.

The original archive includes a wide range of offences, described in the language of the period. For this activity, the dataset has been carefully cleaned so that you can explore it safely. Violent, graphic, sexual, and other sensitive offences have been removed, leaving behind non-violent property and economic crimes, nuisance cases, and some wonderfully quirky misdemeanours.

You will also notice that some entries include the death sentence. This has not been removed, because it was a standard part of the legal system during this period, known as the "Bloody Code". The term refers to a set of laws that made many non-violent crimes punishable by death. Although the sentence appears in the records, it does not include graphic detail, and it is included here to help you understand the historical context of the data.

Every effort has been made to make the dataset appropriate for you to work with. However, historical records can be inconsistent or unexpectedly phrased, so there's always a small chance that something borderline may have slipped through. If you come across anything that feels unsuitable or uncomfortable, you can simply skip that entry - and you're very welcome to let us know at outreach@aber.ac.uk so we can remove it from future versions.

With that in mind, you're ready to start exploring how Python can help you analyse real historical data and discover the stories it contains.

We've put together a set of starter questions to help you build confidence and get used to working with the dataset. Once you're comfortable, you'll have the freedom to explore your own ideas, interests, and questions using the data.

Click on 'Next' to begin.

Getting started

Step 1:

First you will need a copy of the dataset:

Create a new folder to save this into. Then, in the same folder create a Python file using your chosen editor. All examples, work-throughs, and answers presented will match the Thonny editor's format. Colour schemes vary between editors.

Step 2:

Import your csv and set up your records list in Python. You can choose your own variable names or click here to reveal our set-up. Treat all the data as strings - in other words do not assign integer types to them like we did in our previous activity.

                    
    import csv
    
    crime_records = []
    
    with open("Crime_Punishment_Clean.csv", newline="") as file:    
       reader = csv.reader(file)
       next(reader)
    
       for row in reader: 
          row_id = row[0]
          file_no = row[1]
          doc_no = row[2]
          f_name = row[3]
          surname = row[4]
          sex = row[5]
          alias = row[6]
          parish = row[7]
          county = row[8]
          status = row[9]
          crime_id = row[10]
          day = row[11]
          month = row[12]
          year = row[13]
          parish_of_crime = row[14]
          county_of_crime = row[15]
          accusor = row[16]
          plea = row[17]
          verdict = row[18]
          punishment = row[19]
          testimonial_1 = row[20]
          testimonial_2 = row[21]
          testimonial_3 = row[22]
          testimonial_4 = row[23]
          testimonial_5 = row[24]
    
          crime_records.append((row_id, file_no, doc_no, f_name, surname, sex, alias, parish, county, status, crime_id, day, month, year, parish_of_crime, county_of_crime, accusor, plea, verdict, punishment, testimonial_1, testimonial_2, testimonial_3, testimonial_4, testimonial_5))

Searching for data

Using for-loops with if-statements to find specific information in our dataset.

In case you skipped straight to the exercises - these are testing knowledge covered in our Introducing Python activity.

We have included the option of viewing hints, walk-throughs, and/or answers.

Exercise 1

How many alleged crimes involved the theft of nutmeg?

Hints:

This involves searching for 'nutmeg' in the column labelled testun_1 (testimonial_1).

You will need to create a for-loop to go through all the records.

Produce an if-statement looking for nutmeg with and without a capital N to be thorough.

A counter variable will allow you to answer this question

You may want to print these records to double check the details regarding the crime - there may be some records which do not fit the context of the question.

Walk-through

We need to identify which column needs searching - in this case it is the one labelled as testun_1 in the CSV. This is item number 20 in our Python list (remember: computer programs start counting at 0).

Now we will need to write a for loop to look through all the records. The below example uses the variable of crime_records for our imported data.

                                
        for record in crime_records:

We now need an if-statement to search for the word 'nutmeg' or, to be thorough, 'Nutmeg'.

                                
        for record in crime_records:
            if "nutmeg" in record[20] or "Nutmeg" in record[20]:

We can include a counter. This means creating a counter variable set to 0 before the for-loop and then adding one each time the if-statement triggers. A print instruction is then needed to tell us the result.

                                
        nutmeg_counter = 0

        for record in crime_records:
            if "nutmeg" in record[20] or "Nutmeg" in record[20]:
                nutmeg_counter += 1

        print("Total references to nutmeg:", nutmeg_counter)

The answer printed in the shell should read as Total references to nutmeg: 7

We do not know what the reference is to nutmeg and so it is good practice to view them ourselves. So, we need to add a line that prints each of these records to read, in this example we've also printed the count to show the start of each record. We can then remove this line afterwards (or comment it out - hiding it by putting a #before the instruction).

                                
        nutmeg_counter = 0

        for record in crime_records:
            if "nutmeg" in record[20] or "Nutmeg" in record[20]:
                nutmeg_counter += 1
                print(nutmeg_counter, record[20])

        print("Total references to nutmeg:", nutmeg_counter)

Reviewing these results helps us to recognise thefts of nutmeg vs thefts of nutmeg graters.

Answer:

There are 7 references to nutmeg within the testimonials. Four are regarding theft of nutmeg graters. Meaning, the answer is 3.

Exercise 2

Can you determine the crime allocated the rhif_trosedd (crime_id) of 2050?

Hints

This involves searching for '2050' in the column labelled rhif_trosedd (crime_id).

We will then need to print the crime descriptions in the column labelled testun_1 (testimonial_1).

Use the results to determine the crime.

Walk-through

We already have a 'for loop' going through the records from exercise 1 - we do not need to create a new one but include a latest 'if statement' inside it.

We will need to add the following if-statement to find the relevant crime id records:

                                
        if "2050" in record[10]:

We then need to print the first testimonial column contents for each occurrence.

                                
        if "2050" in record[10]:
            print("Crime listed against 2050:", record[20])

This is how our code should look now:

                                
        nutmeg_counter = 0

        for record in crime_records:
            if "nutmeg" in record[20] or "Nutmeg" in record[20]:
                nutmeg_counter += 1
                #print(nutmeg_counter, record[20])
            if "2050" in record[10]:
                print("Crime listed against 2050:", record[20])
            

        print("Total references to nutmeg:", nutmeg_counter)

Read the printed messages to work out the crime associated with the id number of 2050.

Answer

Crime 2050 refers to three cases of illegal sheep shearing - shearing a sheep that does not belong to you and stealing the wool.

Exercise 3

What is the full name of the woman alleged to have used a 'pickle' as a weapon?

Hints

You will need to search for the word pickle in the column titles testun_1 (testimonial_1).

Then, you will need to print the first name (in the enw_cyntaf column) and the surname (in the cyfenw column).

Remember: As you build up your code, make sure your print lines have enough detail to pick them out and/or 'comment out' the lines that are not relevant using a hash symbol at the beginning of the line.

Walk-through

If you want to comment out the previous exercise from your code you will need to remove the whole 'if statement' as you cannot leave it empty of active code.

It is also very good practice to document your code. This means adding additional comments (using the hash symbols) to your code to allow you and others to understand what each section does. Here is an example of how we've documented the code so far:

                                
    nutmeg_counter = 0

        for record in crime_records:
            #Excercise 1 -----
            #if "nutmeg" in record[20] or"Nutmeg" in record[20]:
                #nutmeg_counter += 1
                #print(nutmeg_counter, record[20])
            #-----
            #Excercise 2 -----
            #if "2050" in record[10]:
                #print("Crime listed against 2050:", record[20])
            #-----
        
        #Excercise 1 -----
        #print("Total references to nutmeg:", nutmeg_counter)
        #-----

You will need a new 'if statement' inside the for loop to look for the word pickle in the column labelled testun_1

                                
        if "pickle" in record[20]:

Using a print instruction, you can have your program show you the first name and surname from the relevant columns.

                                
        if "pickle" in record[20]:
            print("Full name of pickle wielder:", record[3], record[4])

This should print the answer in your shell/terminal when the program is run.

Answer

The name of the woman who allegedly used a pickle to maim a horse is: Jane Richards.

Exercise 4

What was the crime for which someone escaped the death penalty with a King's special pardon?

Hints

There is only one record with the exact phrasing of 'King's special pardon' within the column titled cosb (sentence).

As with previous exercises, the crime can be identified by printing the column entitled testun_1 (testimonial_1).

Remember: To keep you code tidy and easy to refer back to, document it as you go.

Walk-through

Let's add a new if statement to our for loop to look for the phrase "King's special pardon" in the cosb (sentence) column.

                                
        if "King's special pardon" in record[19]:

Now, print out the contents of the testun_1 (testimonial_1) column for this crime.

                                
        if "King's special pardon" in record[19]: 
            print("The King's special pardon was given to the prson found guilty of:", record[20])

When you run the program, you should now see the crime in your shell/terminal window.

Answer

The King's special pardon was given to a person found guilty of the crime of 'Coining' - the counterfeiting of coins.

Exercise 5

There are a number of cases where someone is accused of the 'Rescue' of someone. How many people were rescued from gaol - the old English spelling of jail?

Hints

You will need to work out the keywords to search for in the testun_1 column.

Then, you will need to go through each record to confirm the number of people rescued - there may be some repeats due to multiple people being involved.

Walk-through

First, you will need to identify the keywords we need to search for in the testun_1 column. These are: 'Rescue' and 'gaol'.

You will need an 'if statement' that looks for both of these words:

                                
        if "Rescue" in record[20] and "gaol" in record[20]:

You will then need to print the full details from the testun_1 column for these occurrences.

                                
        if "Rescue" in record[20] and "gaol" in record[20]:
            print("Occurence of jail-break:", record[20])

Read through the information this program provides in your shell/terminal to identify the number of people rescues/escaped from jail.

Answer

There are 6 alleged crimes of rescuing/escaping jail. However, due to multiple accused for some of these crimes, the testimonial shows only three different people were broken out of jail.

Exercise 6

What are the full names of the owners of forty hounds that triggered a nuisance complaint?

Hints

This question has the same structure and involves searching the same columns as the one involving a pickle.

Walk-through

Create an 'if statement' to look for the phrase 'forty hounds' in the testun_1 column.

                                
        if "forty hounds" in record[20]:

Print out the contents of the first name and surname columns to give us the full name of the accused.

                                
        if "forty hounds" in record[20]:
            print("Full name of one owner of forty hounds:", record[3], record[4])

The results of your search should give you the answers in your shell/terminal window when you run this program. Remember, as this is all historical records, it can often be incomplete.

Answer

This exercise illustrates how historical and real-world data can often be incomplete. There are 4 allegations of nuisance involving forty hounds barking through the night. However, only three of the accused names are complete: Henry William, Maurice Stephens and Richard Tudor. The fourth has the Tudor surname but we do not know if this was a second accusation raised against Richard or another person with the same surname.

Click here to see the full program

If you have documented and commented out each of previous exercises as you've worked through this the code should look like this:

                    
    nutmeg_counter = 0

    for record in crime_records:
        #Excercise 1 -----
        #if "nutmeg" in record[20] or"Nutmeg" in record[20]:
            #nutmeg_counter += 1
            #print(nutmeg_counter, record[20])
        #-----
        #Excercise 2 -----
        #if "2050" in record[10]:
            #print("Crime listed against 2050:", record[20])
        #-----
        #Excercise 3 -----
        #if "pickle" in record[20]:
            #print("Full name of pickle wielder:", record[3], record[4])
        #-----
        #Excercise 4 -----
        #if "King's special pardon" in record[19]: 
            #print("The King's special pardon was given to the prson found guilty of:", record[20])  
        #-----
        #Excercise 5 -----
        #if "Rescue" in record[20] and "gaol" in record[20]:
            #print("Occurence of jail-break:", record[20]) 
        #-----
        #Excercise 6 -----
        if "forty hounds" in record[20]:
            print("Full name of one owner of forty hounds:", record[3], record[4])
        #-----
    
    #Excercise 1 -----
    #print("Total references to nutmeg:", nutmeg_counter)
    #-----

Counting

We've already looked at creating our own counters in Python - adding one to a variable each time a requirement is met. For example, in exercise 1 we added a counter for the number of times "nutmeg" appeared in one of the data columns.

However, what if we wanted to find out which name/crime/verdict/sentence is most common? At this stage in our Python programming skills this would involve determining all the possible answers and then doing a count for each.

Don't worry - we will not be asking you to do this. Instead, we're going to introduce a new tool to our programs which can do it all for us. This new tool is a set of instructions provided by a counter package which can be retrieved and imported from a library called 'collections.'

Example walk-through

First, we need to import this new tool - best place to add this line is at the start of the program (where we are already importing the csv library):

                    
    from collections import Counter

We can now continue our program underneath the last exercise.

Let's say we wanted to know the most common day for crimes to have been allegedly committed for this example. The first thing this new tool will need is a new list variable (called day_list) which stores all the values from a set column. In this case the one labelled dydd (day).

                
    day_list = [record[11] for record in crime_records]

Notice how this new variable involves calling the relevant column from each record using the same for loop structure we've been using for our own queries.

Now we can use our new tool to store a count of each option in a variable called day_count:

                
    day_list = [record[11] for record in crime_records]
    day_count = Counter(day_list)

We can now do a new for loop to retrieve the value (content) for each record and its total number of occurrences to print only top result(s).

For the highest occurrence:

                
    for value, count in day_count.most_common(1):
        print(count, "of day", value)

For the top 3 results:

                
    for value, count in day_count.most_common(3):
        print(count, "of day", value)

For the top 10 results:

                
    for value, count in day_count.most_common(10):
        print(count, "of day", value)

Counting

Here are some exercises designed to help you practice the use of our new counter tool.

Exercise 7

What is the most common first name of the accused?

Hints

You will need to create a list variable of first names

The Counter tool can then be used to create a count of each item in the list.

A new 'for loop' for our counts to print out the most common is then needed.

Walk-through

If you haven't already, import the counter tool from the collections at the start of your program - see the previous section's walk-through for how.

Create a new list variable to store all values of our enw_cyntaf column:

                            
    f_name_list = [record[3] for record in crime_records]

Use our new counter tool to provide counts for this list:

                            
    f_name_list = [record[3] for record in crime_records]
    f_name_count = Counter(f_name_list)

We then need a for loop to return the value (name) of the most common occurrence:

                            
    f_name_list = [record[3] for record in crime_records]
    f_name_count = Counter(f_name_list)

    for value, count in f_name_count.most_common(1):

In this new for loop, you will then need to write a print instruction to give us the answer in the shell/terminal window:

                            
    f_name_list = [record[3] for record in crime_records]
    f_name_count = Counter(f_name_list)

    for value, count in f_name_count.most_common(1):
        print("The most common first name is:", value)

the answer should now be produced in your shell/terminal when you run the program.

Answer

The most common first name of the accused in these records is 'John'.

Exercise 8

What are the five most common home counties for the accused?

Hints

This exercise involves identifying the correct column for the information requested

Then, using a counting for loop to print the top 5 most common results.

Walk-through

First, you will need to create the necessary list and count variables for the sir (county) column

                            
    county_list = [record[3] for record in crime_records]
    county_count = Counter(county_list)

You can now write the for loop to print the 5 highest counts

                            
    county_list = [record[8] for record in crime_records]
    county_count = Counter(county_list)

    for value, count in county_count.most_common(5):
        print("There are", count, "mentions of", value, "as home county for the accused")

Answer

The five most common home counties for the accused are:

Glamorgan (1948)
Brecon (1514)
Carmarthen (1251)
Denbigh (1242)
Montgomery (1237)

Exercise 9

What are the top three crime ids and what crimes do they represent?

Hints

First you will need to work out the top three most common crime ids.

Then you will need to search for these codes (one at a time) and print their matching testimonials to determine the crimes.

Walk-through

Repeat the progress of creating a count and printing the results (changing the for loop to only doing 3 results):

                            
    crime_code_list = [record[10] for record in crime_records]
    crime_code_count = Counter(crime_code_list)

    for value, count in crime_code_count.most_common(3):
        print("There are", count, "accusations of crime_id:", value)

When we run this new program we get the top three crime_id of: 1200, 3400, and 1460. Let's start with crime_id 1200. In the original for loop, we used for search (which you may need to reactivate if commented out by removing the hash), we need to look for crime_id "1200" and print the contents of testun_1 to work out the crime.

                            
    for record in crime_records:
        #Exercises 1-6 commented out
        #Exercise 9 -----
        if "1200" in record[10]:
            print(record[20])

This will print a long list of 1339 testimonials - which all have one thing in common: The theft of sheep.

Change which crime_id you are looking for to 3400, and then 1460 to complete this exercise. You do not want to have all three printed in the shell/terminal as that becomes messy and makes it harder to differentiate between datasets.

Answer

The most common crime_id is 1200 which refers to sheep theft. The next most common is crime_id 3400 - breaking and entering. The next most common crime_id is 1460 - theft of food.

Click here to see the full program

This is the complete program for this batch of exercises, excluding the CSV import.

                    
    #Exercise 7 -----
    
    f_name_list = [record[3] for record in crime_records]
    f_name_count = Counter(f_name_list)

    for value, count in f_name_count.most_common(1):
        print("The most common first name is:", value)
    
    #Exercise 8 -----

    county_list = [record[8] for record in crime_records]
    county_count = Counter(county_list)

    for value, count in county_count.most_common(5):
        print("There are", count, "mentions of", value, "as home county for the accused")

    #Exercise 9 -----

    crime_code_list = [record[10] for record in crime_records]
    crime_code_count = Counter(crime_code_list)

    for value, count in crime_code_count.most_common(3):
        print("There are", count, "accusations of crime_id:", value)

    for record in crime_records:
        if "1200" in record[10]:
            print(record[20])
        if "3400" in record[10]:
            print(record[20])
        if "1460" in record[10]:
            print(record[20])

Percentages

Now that we've practiced using a counter, we can start to introduce some mathematics to our Python program to calculate percentages for us.

Let us look at how to do this to calculate the percentage of our accused with the first name of John.

We have already created the necessary list and count variables in exercise 7 for us to start with:

                
    f_name_list = [record[3] for record in crime_records]
    f_name_count = Counter(f_name_list)

We can now add a new variable to store the total number of counts for this column:

                
    total_count = len(crime_records)

This variable is calling for the total number of crime records by measuring the length of the list.

We now need to tell the program to 'get' the total count for John and then use this value and the total count to calculate the percentage. When programming for mathematical symbols we use +, -, /, and *.

                
    total_count = len(crime_records)

    percent_John = (f_name_count.get("John", 0) / total_count) * 100
    print(percent_John, "% of accused have the first name of John")

The value of 0 inside the get() function is telling the program that if it fails to find what you've requested, it will turn the value of 0.

To reduce the number of decimal places (in this case down to 2) for our print command we need to make the following change:

                
    total_count = len(crime_records)

    percent_John = round((f_name_count.get("John", 0) / total_count) * 100, 2)
    print(percent_John, "% of accused have the first name of John")

When you run this new program, you should get the answer of 16.54% of accused have the first name of John.

Percentages

We've put together some exercises to test the new skill of implementing the necessary code to calculate percentages.

Exercise 10

What percentage of alleged crimes were thefts of sheep?

Hints

You should already know the crime_id for the theft of sheep from a previous number.

You also already have the necessary list and count variables for crime_id.

If you have not already, create a variable for the total number of records and then use this to calculate the percentage requested.

Print this value (remember, you can use the round() function) to find out the answer.

Walk-through

We already have the crime_id for the theft of sheep (1200), along with the list and count variables for crime_ids. The work-through in the previous section includes a total_count variable which is equal to the total number of records.

This means we now just have to bring all those values together to calculate the percentage for this exercise. The below example includes the round(valueToRound, numberOfDecimalPlaces) function to give an answer to 2 decimal places.

                            
    percent_1200 = round((crime_code_count.get("1200", 0) / total_count) * 100, 2)

Now we just need the print command to get the answer into our shell/terminal.

                            
    percent_1200 = round((crime_code_count.get("1200", 0) / total_count) * 100, 2)
    print(percent_1200, "% of crimes involved the theft of sheep")

Answer

10.62% of allegations involved the theft of sheep.

Exercise 11

What percentage of alleged crimes were committed by criminals in Brecon?

Hints

For this, you will need to create new a new list and count for the correct column.

You already have a total count variable.

Walk-through

We need to create a new list and count variable for county of crime (column lle_sir)

                            
    county_of_crime_list = [record[15] for record in crime_records]
    county_of_crime_count = Counter(county_of_crime_list)

We already have a variable for total count so, it is time to create the variable of percentage for Brecon and print the solution - ideally using the round() function.

                            
    county_of_crime_list = [record[15] for record in crime_records]
    county_of_crime_count = Counter(county_of_crime_list)
    percent_Brecon = round((county_of_crime_count.get("Brecon", 0) / total_count) * 100, 2)
    print(percent_Brecon, "% of crimes occuring in Brecon")

Answer

13.04% of crimes occurred in the county of Brecon.

Exercise 12

Approximately, what percentage of accused were found guilty?

Hints

As with many real-world datasets, this is not as simple to answer as you'd initially believe.

You will need to investigate the use of the word guilty within the verdict column.

This is an example of where the counter tool is not the best solution and so, you may want to look at using the original search and count method.

When referring to guilty verdicts there is almost always a capital G, whilst non guilty always has a lower-case g.

Walk-through

The verdict column of our dataset is not as clear-cut as the crime id numbers or county names. There is a lot more variation than you'd expect. To see what we're dealing with we can use the counter tool to print all counts for this column by leaving the most_common() with empty brackets.

                                                        
    verdict_list = [record[18] for record in crime_records]
    verdict_count = Counter(verdict_list)
    for value, count in verdict_count.most_common():
        print("There are", count, "verdicts of", value)

When you run this code, it may take some time for it to print out all the different variations of this data. Adding an additional line to print the length of the count list (len(list_name)), lets us see just how many possible answers there are to the verdict.

                            
    print("There are", len(verdict_count), "different answers")

There are 568 variations in the wording of the verdict. This shows that the Counter we've been using for tidier columns will not work here.

Unlike earlier questions, this one uses a column that isn't tidy. The verdicts were written by many different clerks over 100 years, so the wording varies massively. Because of that, we can't rely on automated counting tools - we must decide what counts as 'Guilty' and search for it manually. This is exactly what real data scientists do: define rules, accept limitations, and work with imperfect information.

Our best approach is to create a counter variable and search for "Guilty" in the verdict column (Not guilty always uses a lower-case g whilst the Guilty verdicts are almost always capitalised). This reduces the error in a count compared to using the counter tool.

                            
        guilty_counter = 0
        for record in crime_records:
        #Previous exercises commented out
        #Exercise 12 -----
        if "Guilty" in record[18]:
            guilty_counter += 1
        print("There are", guilty_counter, "verdicts starting with 'Guilty'")

This program provides us with 3175 records that have a verdict starting with the term Guilty. We can now use this guilty_counter variable instead of the count.get() in our percentage calculation.

                            
    percent_Guilty = round((guilty_counter / total_count) * 100, 2)
    print(percent_Guilty, "% of verdicts starting with the term 'Guilty'")

This will print out the answer in the shell/terminal window.

Answer

A good approximation would be around 25%.

Click here to see the full program

This is the complete program for this batch of exercises, excluding the CSV import.

                    
    #Exercise 10 -----

    total_count = len(crime_records)

    crime_code_list = [record[10] for record in crime_records]
    crime_code_count = Counter(crime_code_list)

    percent_1200 = round((crime_code_count.get("1200", 0) / total_count) * 100, 2)

    #Exercise 11 -----

    county_of_crime_list = [record[15] for record in crime_records]
    county_of_crime_count = Counter(county_of_crime_list)
    percent_Brecon = round((county_of_crime_count.get("Brecon", 0) / total_count) * 100, 2)
    print(percent_Brecon, "% of crimes occuring in Brecon")

    #Exercise 12 -----

    verdict_list = [record[18] for record in crime_records]
    verdict_count = Counter(verdict_list)
    #for value, count in verdict_count.most_common():
        #print("There are", count, "verdicts of", value)
    #print("There are", len(verdict_count), "different answers")
    
    guilty_counter = 0

    for record in crime_records:
    if "Guilty" in record[18]:
        guilty_counter += 1
    #print("There are", guilty_counter, "verdicts starting with 'Guilty'")

    percent_Guilty = round((guilty_counter / total_count) * 100, 2)
    print(percent_Guilty, "% of verdicts starting with the term 'Guilty'")

Sub-lists

Sometimes, to answer a query we need to create a sub-list from the dataset. This means we're creating a smaller set of records that all match a criterion. We can then use 'for loops' to explore these instead of the full list.

This allows us to study and analyse a group within the data. This example walk-through will look to answer the question "What percentage of female accused were issued the death penalty?". This single question could be answered using an 'if statement' within our existing 'for loop':

                
    if "F" in record[5] and "Death" in record[19]:

However, if you wish to re-use the female dataset for multiple queries, it can save time, processing and coding, and reduce errors.

To create a new sub-list in which to store our female only records, we first must create a list variable:

                
    female_record = []

Then, using the for loop that explores all the records we create add a new if statement:

                
    female_record = []

    for record in crime_records:
        if "F" in record[5]:

Inside this we tell our program to add the record to our female_records list:

                
    female_record = []

    for record in crime_records:
        if "F" in record[5]:
            female_records.append((record))

Now, we have a new list we can search through in a new for loop to determine death sentence count:

                
    female_record = []

    for record in crime_records:
        if "F" in record[5]:
            female_records.append((record))

    female_death_counter = 0

    for record in female_records:
        if "Death" in record[19]:
            female_death_counter += 1

    total_female_records = len(female_records)
    female_death_percent = round((female_death_counter / total_female_records) * 100, 2)
    print(female_death_percent, "% of accused women were sentenced to death")

This provides us with a value of 4.04% of women accused being issued the death sentence.

Sub-lists

The below exercises are designed to practice the creation and use of sub-lists.

Exercise 13

What are the three most common female first names for the accused?

Hints

The example provided in the previous section demonstrated how to create the sub-list of records for all accusations against females.

Use this sub-list to determine the name list and counts.

Walk-through

If you've not done so already, include the code provided in the example which creates the sub-list of only female accused records.

                            
    female_record = []

    for record in crime_records:
        if "F" in record[5]:
            female_records.append((record))

Now you can use this sub-list to create the variables to store all the first names of females into and the count variable which uses the Counter tool.

                            
    female_name_list = [record[3] for record in female_records]
    female_name_count = Counter(female_name_list)

Now we can write the 'for loop' through this count to determine the top 3 results and ask to have them printed.

                            
    for value, count in female_name_count.most_common(3):
        print("The most common female first names:", value, "with", count, "listings")

This program will now print the top three female names in your shell/terminal window.

Answer

The most common female names of the accused in these records are: Mary, Elizabeth, and Margaret.

Exercise 14

Using a sub-list of records for crime_id 1200, approximately what percentage of these crimes were issued a transportation order?

Hints

You will need to create a sub-list of all the records for crime_id 1200. Remember: Variable names cannot begin with numbers.

Then you will need to go through these records to find reference to "Transported" (and "transported") in the verdict column.

Use a variable to store your count in and then use this to calculate the percentage.

Walk-through

First thing to do for this is to create a sub-list of all the records for the crime id of 1200.

                            
    crime_1200_records = []

    for record in crime_records:
        if "1200" in record[10]:
            crime_1200_records.append((record))

We then need to look for all references to "Transported" or "transported" in the sentence column and store the count in a new variable:

                            
    transported_1200_counter = 0
    
    for record in crime_1200_records:
        if "Transported" in record[19] or "transported" in record[19]:
            transported_1200_counter += 1

To calculate the percentage, we now have a value for the number of times those accused of a 1200 crime were transported as part of their sentence. Next, we need the total number of records in our sub-set before we can proceed.
```
                            
    total_1200_records = len(crime_1200_records)
                            
                        
```

Below is then how to write the percentage calculation to one decimal place and print the answer to the shell/terminal window.

                            
    transported_1200_percent = round((transported_1200_counter / total_1200_records) * 100, 1)
    print("Approximately", transported_1200_percent, "% of 1200 crimes resulted in transportation")

Answer

By searching a sub-list of 1200 crime codes for references to transportation in the sentence column you will get an answer of 6.7%

Exercise 15

Compare approximate percentage of guilty verdicts between male and female accused.

Hints

You should already have a list of all female records. Now, you will need to also get a list of male records.

Best practice would be to add an 'elif statement' to our 'if "F" statement' to detect "M".

Remember: We've already determined that the best approach the question of a guilty verdict is to search for and count occurrences of "Guilty".

Walk-through

You already have the sub-list for female records, now you need one for male records. As this is real-world data, we need to consider any unknowns or exceptions within the data. You could write a new 'if statement' to create a male records list or add an elif to the creation of the female list as shown below.

                            
    female_record = []
    male_record = []

    for record in crime_records:
        if "F" in record[5]:
            female_records.append((record))
        elif "M" in record[5]:
            male_records.append((record))

For this walk-through we chose to calculate the percentage for women found guilty first:

                            
    female_guilty_counter = 0

    for record in female_records:
        if  "Guilty" in record[18]:
            female_guilty_counter += 1

    female_guilty_percent = round((female_guilty_counter / total_female_records) * 100, 1)

To repeat the same process with the male records you will also need to have a variable for the total number of records in the male sub-list.

                            
    male_guilty_counter = 0

    for record in male_records:
        if  "Guilty" in record[18]:
            male_guilty_counter += 1

    total_male_records = len(male_records)
    male_guilty_percent = round((male_guilty_counter / total_male_records) * 100, 1)

We can now print a line comparing female guilty percentage against male.

                            
    print(female_guilty_percent, "% of women vs", male_guilty_percent, "% of men were found Guilty")

Answer

Approximately, 35% of women were found guilty vs 23% of men.

Exercise 16

Are the 5 most common crimes the same for men and women?

Hints

You already have the female and male sub-lists from the previous exercises

You will need to use the Counter tool on both to determine the top 5 crime codes/ids.

Do the results match?

Walk-through

We already have the necessary sub-lists for this. For this walk-through we shall start with the top 5 crime codes/ids in the female set.

                            
    female_crimes_list = [record[10] for record in female_records]
    female_crimes_count = Counter(female_crimes_list)
    
    for value, count in female_crimes_count.most_common(5):
        print("The most common female crimes:", value, "with", count, "listings")

You can either take note of the results of this from your shell/terminal and then change the program to male instead or add a section for doing the same with the male data.

                            
    male_crimes_list = [record[10] for record in male_records]
    male_crimes_count = Counter(male_crimes_list)
    
    for value, count in male_crimes_count.most_common(5):
        print("The most common male crimes:", value, "with", count, "listings")

Do the results match?

Answer

No, the top five crimes allegedly committed by women are not the same as those for men.

The most common crime codes/ids for women were: 1500, 1490, 1460, 3400, and 1540.

The most common crime codes/ids for men were: 1200, 3400, 1460, 1160, and 1260.

Feel free to explore the data further to discover which crimes these codes refer to.

Click here to see the full program

This is the complete program for this batch of exercises, excluding the CSV import.

                    
    female_records = []
    male_records = []
    crime_1200_records = []

    for record in crime_records:
        if "F" in record[5]: #for exercise 13
            female_records.append((record))
        elif "M" in record[5]: #for exercise 15
            male_records.append((record))
        if "1200" in record[10]: #for exercise 14
            crime_1200_records.append((record))  
    
    #Exercise 13 -----

    female_name_list = [record[3] for record in female_records]
    female_name_count = Counter(female_name_list)

    for value, count in female_name_count.most_common(3):
        print("The most common female first names:", value, "with", count, "listings")
        
    #Exercise 14 -----

    transported_1200_counter = 0

    for record in crime_1200_records:
        if "Transported" in record[19] or "transported" in record[19]:
            transported_1200_counter += 1

    total_1200_records = len(crime_1200_records)

    transported_1200_percent = round((transported_1200_counter / total_1200_records) * 100, 1)
    print("Approximately", transported_1200_percent, "% of 1200 crimes resulted in transportation")

    #Exercise 15 -----

    female_guilty_counter = 0

    for record in female_records:
        if  "Guilty" in record[18]:
            female_guilty_counter += 1

    total_female_records = len(female_records)
    female_guilty_percent = round((female_guilty_counter / total_female_records) * 100, 1)

    male_guilty_counter = 0

    for record in male_records:
        if  "Guilty" in record[18]:
            male_guilty_counter += 1

    total_male_records = len(male_records)
    male_guilty_percent = round((male_guilty_counter / total_male_records) * 100, 1)

    print(female_guilty_percent, "% of women vs", male_guilty_percent, "% of men were found Guilty")

    #Exercise 16 -----

    female_crimes_list = [record[10] for record in female_records]
    female_crimes_count = Counter(female_crimes_list)
    
    for value, count in female_crimes_count.most_common(5):
        print("The most common female crimes:", value, "with", count, "listings")

    male_crimes_list = [record[10] for record in male_records]
    male_crimes_count = Counter(male_crimes_list)
    
    for value, count in male_crimes_count.most_common(5):
        print("The most common male crimes:", value, "with", count, "listings")

Summary

This activity has explored several different ways to use Python to search and begin analysing data. Along the way, you've seen how real-world datasets can be inconsistent, incomplete, or unexpectedly complex - and how, in those situations, simpler approaches often work best.

Working with historical court records also highlighted another important aspect of real data: when we adapt a dataset to make it age-appropriate, we inevitably lose part of the full picture. In the filtered version used here, the most common offences are theft of sheep, breaking and entering, and theft of food. In the complete public record, however, the most frequent offences are assault, riot involving assault, and then theft of sheep.

Across these exercises, you've strengthened your understanding of how to transfer data into a CSV, retrieve information using Python, and apply additional tools and methods to answer specific questions, identify patterns, and generate simple statistics.

The dataset used in this activity is part of the public collections held by the National Library of Wales, and many more are available. If a particular topic interests you, try exploring another dataset and see what conclusions you can draw using the skills you've developed here.

We hope you've enjoyed this activity.