How to extract text structured information with Python and regular expressions?

pain spot

Many people have to deal with a lot of texts in their daily work.

For example, scholars need to read a lot of literature to find inspiration, data and arguments.

Students need to read many textbooks and papers, and then write their own reports or make slides.

Financial analysts need to find clues about the development trend of the industry and the dynamics of the target enterprises from a large number of news reports.

Not all word processing is so fresh and interesting.

An important but tedious task is to extract structured information from a large number of texts.

Many data analysis scenarios need to input structured information.

For example, "Loan or not: How to use Python and machine learning to help you make decisions?" ? And "how to use Python and deep neural network to lock the customers who are about to lose? As you can see, machine models prefer structured tabular information.

However, structured information is not necessarily there, waiting for you to use it. Many times, it is hidden in unstructured text generated in the past.

You may be used to reading text information manually, extracting key points, and then copying and pasting them into a table. In principle, this is understandable. But in practice, it is too inefficient and troublesome.

Most people don't want to do this simple and repetitive boring job.

Repeat the mouse repeatedly to delimit the text range, "Ctrl+C", switch to the table document, find the input position accurately, and then "CTRL+V" ...

If you do this kind of thing too much, it may have a bad influence on your shoulder and elbow joint, and even your physical and mental health.

Do you want to try to use a simpler automated method to quickly complete these annoying operation steps for you? ?

After reading this article, I hope you can find the answer.

sample

Here, we give an extremely simplified example of Chinese text information extraction.

The reason for this is to avoid spending too much time on interpreting the data.

I hope you can focus on methods and master new knowledge.

Suppose a high school class teacher asks the monitor to count the graduation destinations of students after the college entrance examination. The monitor made a serious investigation and then made the following report:

Zhang Hua was admitted to Peking University.

Ping Li entered a secondary technical school.

Han Meimei walked into the department store.

……

To familiarize you with the examples, there is even a * * * sound. Here, I "borrowed" some contents from 1998 Xinhua dictionary.

Isn't that sweet?

In real life, there are probably more than three people in a class, so you can imagine that this is a long list of sentences.

But in fact, the teacher in charge also has an unspoken implication, namely:

I want a form!

Therefore, you can imagine his expression when you see this long list of sentences.

Monitor estimates also embarrassed:

If you want a form, you should say it!

At this time, suppose you are the monitor, what should you do?

The information is all in the text. But if you need to convert it into a table, you have to find it one by one and deal with it one by one.

In fact, for a class of forty or fifty people, manual operation is not too difficult.

But imagine if the amount of data you need to process is ten times, a hundred times or even ten million times that of this example?

Continue to insist on manual processing?

This is not only troublesome, but also unrealistic.

We need to find a simple method to help us automatically extract the corresponding information.

The method we use here is a regular expression.

regular

The name "regular expression" sounds mysterious at first glance. In fact, it is translated from English "regular expression".

If translated into vernacular, it is a "regular expression form".

This, it sounds, is it more grounded?

However, make up the course "Counterfeiting Experts 10 1" for you:

Who can you scare by saying what others can understand?

By convention, let's continue to call it "regular expression".

Since it came out, it has brought high efficiency to text processing.

However, the main people who use it are not writers, editors, scholars and clerks who often deal with words, but ...

Program arranger

The code written by programmers is text; Much of the data that programmers deal with is also in text format. There are many remarkable laws to follow.

It is precisely because of the unique secret of regular expressions that many others need a whole week's task in the dark, and programmers can finish it in half an hour, and then wait for the work after drinking coffee.

Even in today's pan-artificial intelligence, regular expressions still have many unexpected applications.

Such as a man-machine dialogue system.

You may have seen news reports, and always think that human-computer dialogue is made by knowledge map or deep learning.

It can't be said that there is no participation of the above Cool Technology. But at best, they only account for a part of it, maybe only a small part.

In production practice, behind a large number of dialogue rules is not a mysterious and profound neural network, but a bunch of regular expressions.

You may worry, can you master such high-end application technology yourself?

The answer is:

Sure!

Regular expressions are not difficult to learn.

Especially if you combine it with Python, it is simply an efficiency artifact.

Let's see how regular expressions can help us identify "name" and "destination" information in sample text.

Trial practice

Please open a browser and type this URL (/).

You will see the following interface.

It can be a sharp tool for regular expression experiments. When I taught INFO 573 1, the students quickly played regular expressions after mastering this tool.

Such a good tool must be expensive, right

No, it's free. You can use it boldly.

Let's first adjust the programming language on the left from the default PHP to Python.

After that, paste the text to be processed into a large text box with a blank middle.

Let's try to match it.

What is a match?

That is, if you write an expression, the computer will take a chicken feather as an arrow, and on each line of text, carefully look for whether there is a paragraph that matches the expression.

If there is, it will be highlighted.

We observe here and find that in every sentence, there is a word "le" in front of people.

Ok, let's enter the word "le" in the small text box at the top of the middle.

As you can see, the word "le" in three sentences is lit up.

This is the first matching method you come into contact with-finding consistent content according to the original meaning of characters.

Because of the regularity of the sample text, we can regard "le" as a locator, and after it, to the end of the sentence, it is the information of "go".

Isn't this the semi-structured information we are looking for?

We tried to match "to".

How to match? The words in each line are different this time, right?

It doesn't matter, the power of regular expressions is shown at this time.

You can use some, that is,. To represent any character.

Letters, numbers, punctuation marks ... even Chinese can be covered.

Then let's keep thinking. How many words will there be here

I don't know.

In these three simple sentences, there are two situations: "four words" or "six words".

Therefore, we cannot specify the length of characters in the destination information.

But it doesn't matter, we just need an asterisk (*) to represent the number of occurrences. 0 to infinity can match.

Of course, in practice, infinity will not really appear.

We added. * To the input just now, the result looks like this:

Not bad!

But the destination information and the word "le" seem to be highlighted in the same color. Isn't that a confusion?

We don't want this.

What should we do?

Would you please go? Try to add a pair of brackets around the (be careful not to use Chinese full-width symbols).

You will find that "le" is still blue this time, and the destination information behind it turns green.

This is very important for parentheses. It is called "grouping" and is the basic unit for extracting information.

We have finished half the task, haven't we?

Let's try to extract the names together.

Let's find the anchor of the name.

If you look closely, you can easily find that every name is followed by a verb.

Students who enter colleges and universities use the word "test" and students who are employed use "enter".

Let's try the word "test" first.

Here we try to put the word "Kao" directly in front of the word "Le". But you will find no match.

Why?

Looking back at the information, you will find that the original word used by others is "admission".

Of course, we can enter the word "Shang" here. But you have to consider the more general situation.

For example, what if you are admitted? What about "admission"?

A better way is to continue to use the "big move" we just learned and insert a. * between "test" and "le".

What does your regular expression look like at this time? I passed the exam. * (.*)

Look, did the information in the first line match successfully?

However, there are still two lines that don't match. What should I do?

If we follow the same pattern, we will find that it is used. * (.*)? You can match the last two lines correctly.

Here comes the question:

Those that match the first line cannot match the last two lines, and vice versa.

This is not good. We hope to write more general expressions.

What should we do?

Let's look at the representation of "or" relationship in regular expressions.

Here we can separate the two characters with a vertical line and enclose them in brackets, indicating that either of them appears and the match is successful.

That is, write the regular expression as: [Kao | Jin]. * Le (. *).

Great, all three lines have been matched successfully.

Here, the verb phrase and the tense "le" are used as the intermediate anchor information, so we can safely and boldly extract the previous name information.

That is to say, it says: (. *) [passed |]. * yes (. *).

Note that at this point, the name grouping is green and the destination grouping is red.

We successfully extracted two sets of information! Celebrate!

However, if you show the results here to the class teacher, it is estimated that he will not be satisfied.

Form! I want the form!

Don't worry, it's Python's turn.

Let's try to formally extract data in Python.

environment

I put the supporting source code of this article on Github.

You can reply to "regex" in the background of my official WeChat account "nkwangshuyi" to view the complete code link.

If you are satisfied with my tutorial, please click on the star in the upper right corner of the page and add a star for me. thank you

Please note that in the center of this page, there is a button that says "Open in Colab". Please click on it.

Then, Google Colab will open automatically.

I suggest you click the "Copy to Drive" button circled in red in the above picture. This way, you can save it in your own Google Drive for easy use and viewing.

Colab provides you with a complete running environment. You only need to execute the code in turn to reproduce the running results of this tutorial.

If you are not familiar with Google Colab, it doesn't matter. I have a tutorial here to explain the characteristics and usage of Google Colab.

In order to let you learn and understand the code more deeply, I suggest you open a brand-new notebook in Google Colab, enter the code and run as follows. In this process, fully understand the meaning of the code.

This seemingly clumsy way is actually an effective way of learning.

password

First, read the Python regular expression package.

Imported re

Then, we prepare the data. Note that in order to demonstrate the universality of the code, I added a line of text at the end here, which is different from the previous text rules, to see if our code can handle it correctly.

Data = ""Zhang Hua was admitted to Peking University.

Ping Li entered a secondary technical school.

Han Meimei walked into the department store.

They all have a bright future.

Then, it is time to write regular expressions. Do you really need to write it yourself?

Of course not.

The powerful regex 10 1 website helped us get ready.

Please click the button circled in red in the above picture, and the website will prepare a template of the initial code for you, which can match the pattern you need.

You don't need to copy the code completely. There is such a sentence, which is very important. Just copy and paste it directly into Colab notebook.

Regex = r”(。 *) [Entrance Examination |]. * No (. *) "

This is what your regular expression looks like in Python.

We prepare an empty list to receive data.

My list = []

Then, write a cycle.

For line ('n') in data.split:

? mysearch = re.search(regex,line)

? If I search:

name = mysearch.group( 1)

dest = mysearch.group(2)

mylist.append((name,dest))

Let me explain to you the meaning of each sentence in this cycle:

data.split('n ')? Split text data into rows. So we can get the data of each row. mysearch = re.search(regex,line)? This sentence tries to match the pattern with the line. What if I search? This judgment statement is to let the program distinguish whether this line has the pattern we are looking for. For example, in the last line of text, there is no text pattern that we analyzed earlier. When you meet such a line, just skip it. name = mysearch.group( 1)? Does it mean that the first matching content, that is, the names of green representatives in regex 10 1 website are stored in groups? Name? In the variable. The next sentence and so on. Attention? Group? Count from 1 according to the order in which the brackets appear in the regular expression. mylist.append((name,dest))? Store the information extracted from this row in the empty list we defined earlier.

Attention, if you don't add it? mysearch = re.search(regex,line)? In this sentence, the program will try to match each line and extract the grouped content, and then the result will be an error like this:

So you see, when extracting information with regular expressions, you can't be rude.

In the meantime, can we have a look? My list? Contents of the list:

My list

The results are as follows:

[('Zhang Hua',' Peking University'), ('Ping Li',' technical secondary school'), ('Han Meimei',' department store')]

Yes, one is not much, and the other is not little, which is exactly what we need.

We will export it to a table. There are many methods, but the simplest one is to use Pandas data analysis software package.

Import pandas as pd

Just use it? Police. DataFrame? Function, we can turn a two-dimensional structure composed of the above linked list and tuple into a data frame.

df = pd。 Data Framework (My List)

Df.columns = ['name ',' destination']

Please note that we have also modified the title very carefully here.

Look at the fruits of your labor:

df

For a data frame, it only takes one line of code to convert it into Excel format:

df.to_excel("dest.xlsx ",index=False)

Enter the file tab, refresh and check the contents of the current directory:

This? dest.xlsx? Is the result of the output. After downloading, we can open it in Excel.

Mission accomplished!

You can submit your grades to the class teacher and see his satisfied smile.

summary

In this tutorial, we discussed how to extract structured information by using the rule of text characters, Python and regular expressions.

I hope you have mastered the following skills:

Understand the role of regular expressions;

Try to match regex 10 1 website with regular expression and generate preliminary code;

Use Python to extract information in batches and export structured data to a specified format as required.

Again, such a simple example, using the above method is definitely a mosquito bombing.

However, if you need to process a large amount of data, this method will save considerable time.

I hope you can draw inferences from others and use them flexibly in your own work.

For more Python knowledge, please pay attention to: Python self-study network! !